Unsupported data type

Causes:

The loop assigns one struct variable to another one. But the assignment operator is not defined inside the structure, so there is no translation of this struct assignment in terms of scalars.
The compiler does not support certain data types because there is no corresponding SIMD instruction.
The compiler cannot vectorize a loop containing complex, long, numeric types that do not fit in the vector register width.

C++ Example:

truct char4 {
    char c1;
    char c2;
    char c3;
    char c4;
};

extern struct char4 *a;
void vecmsg_testcore003 ()
{
    int i;
    const struct char4 n = {0, 0, 0, 0};
    #pragma omp simd
    for(i = 0; i < 1024; i++) {
        a[i] = n;
    }
}

Recommendations

Provide struct assignment operators in terms of scalars. For example:

inline char4 operator=(const char4 &x) {
    char4 temp;
    temp.c1 = x.c1;
    temp.c2 = x.c2;
    temp.c3 = x.c3;
    temp.c4 = x.c4;
    return temp;
}

Use standard data types.
Use instruction sets that support wider vectors.

Read More:

Not inner loop

Cause: In nested loop structures, the compiler targets the innermost loop for vectorization. The outer loop, by default, is not a target for vectorization; however, it may be a target for parallelization.
C++ Example:

#include <iostream>
#define N 25
int main()
{
    int a[N][N], b[N], i;
    for(int j = 0; j < N; j++)
    {
        for(int i = 0; i < N; i++)
            a[j][i] = 0;
        b[j] = 1;
    }
    int sum = __sec_reduce_add(a[:][:]) + __sec_reduce_add(b[:]);
    return 0;
}

Recommendation

In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The n argument is an integer that specifies how many loops to collapse into one loop for vectorization.

Target	ICL/ICC/ICPC Directive	IFORT Directive
Outer loop	#pragma omp simd collapse(n), #pragma omp simd, or #pragma simd	!$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD

Read More C++ Information:

Read More Fortran Information:

Remainder loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the remainder loop will not benefit from vectorization.
C++ Example:

#include < iostream >
#define N 70
int main() {
    static short tab1[N],
    tab2[N];
    int i, j;
    static short const data[] = {32768, -256, -255, -128, -127, -1, 0, 1, 127, 128, 255, 256, 32767};
    for (j = i = 0; i < N; i++)
    {
        tab1[i] = i;
        tab2[i] = data[j++];
        if (j > 12) j = 0;
    }
    int sum = __sec_reduce_add(tab1[:]) + __sec_reduce_add(tab2[:]);
    return 0;
}

Recommendations

Force remainder vectorization using a directive before the loop:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source loop	#pragma vector vecremainder	!DIR$ SIMD VECREMAINDER

Disable remainder vectorization using a directive before the loop:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source loop	#pragma vector novecremainder	!DIR$ SIMD NOVECREMAINDER

Read More C++ Information:

Read More Fortran Information:

Loop vectorization possible but seems inefficient

Cause: The compiler vectorizer determined the loop will not benefit from vectorization. Common reasons include:

Non-unit stride memory access
Indirect memory access
Low iteration count

C++ Example: The compiler vectorizer determines the cost of creating a vector operand (non-unit stride access in the vector operand creation) is significant when compared to the number/type of computations in which those vector operands are used.

 #include <iostream>
#define N 100
struct s1 {
    int a, b, c;
}
int main() {
    s1 arr[N], sum;
    for(int i = 0; i < N; i++) {
        sum.a += arr[i].a;
        sum.b += arr[i].b;
        sum.c += arr[i].c;
    }
    std::cout << sum.a << "t" << sum.b << "t" << sum.c << "n";
    return 0;
}

Recommendations

If you still believe vectorization might result in a speedup, override the compiler cost model using a directive before the loop

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source loop	#pragma vector or #pragma vector always	!DIR$ VECTOR or !DIR$ VECTOR ALWAYS

Alternatively, use a compiler option to always vectorize loops. The compiler will still test for dependencies and will not vectorize the loop unless it is safe.

Windows* OS - ICL and IFORT Option	Linux* OS - ICC/ICPC and IFORT Option
/Qvec-threshold0	-vec-threshold0

Require vectorization using a directive before the loop. The compiler will not perform a dependency analysis; it is your responsibility to ensure vectorization is safe:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source loop	#pragma simd or #pragma omp simd	!DIR$ SIMD or !$OMP SIMD

Rewrite the data structure/loop to have more regular memory accesses.

Read More C++ Information:

Read More Fortran Information:

Conditional assignment to a scalar

Causes:

The loop has an assignment operation of a structure variable and there is a complex condition controlling this assignment.
The loop contains a conditional statement and one of the following is true:
- The conditional statement controls the assignment of a scalar value and the value of this variable is used in any of the next iterations or after the loop executes. Exception: loops searching for max, min values and their indices in the array.
- The value of the scalar when loop execution ends depends on the loop executing iterations in strict order.

C++ Example:

void foo(int *A, int *restrict B, int n, int* x) {
    int i;

    #pragma omp simd
    for (i = 0; i < n; i++)
    {
        if (A[i] > i)
            *x = i;
        else
            B[i] = *x;
    }

    B[i] = *x++;
}

Recommendations

Simplify or remove conditions in the loop by:

Dividing the loop into a group of sequential loops
Or using multiple temporary variables instead of one scalar variable

Read More:

C++ information at https://software.intel.com/en-us/articles/cdiag15336
Fortran information at https://software.intel.com/en-us/articles/fdiag15336
Vectorization Resources for Intel® Advisor Users

Assumed dependence between lines

Anti-dependency - Write after read (WAR) - is assumed in a loop.
True dependency - Read after write (RAW) - is assumed in a loop.

C++ Example: When the compiler tries to vectorize for SSE2 architecture, it chooses a vector length of 4 (because the data type it operates on is int). But when considering a vector operand instead of scalar operands for this loop, there is an overlap between the input vector and output vector. Anti-dependency occurs when the k value is positive; true dependency occurs when k value is negative.

 #include < stdlib.h >
#define N 70
int main(int argc, char *argv[])
{
    int k = atoi(argv[1]);
    int a[N], i;
    for(i = abs(k); i < N; i++)
        a[i] = a[i+k] + 1;
    return 0;
}

Recommendations

Rewrite code to remove dependencies.
Run a Dependencies analysis to check if the loop has real dependencies.

If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:

Directive to prevent all dependencies in the loop

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd or #pragma omp simd	!DIR$ SIMD or !$OMP SIMD

Directive to ignore only vector dependencies (which is safer)

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma ivdep	!DIR$ IVDEP

restrict keyword

If anti-dependency exists, use a directive where k is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd vectorlength(k)	!DIR$ SIMD VECTORLENGTH(k)

Read More C++ Information:

Read More Fortran Information:

Non-standard loop is not a vectorization candidate (C++)

Causes:

There is more than one loop exit point.
A SIMD loop uses C++ exception handling or an OpenMP critical construct.
The compiler cannot determine which function is passed as a function parameter.

Below are examples for all three scenarios.
C++ Example 1: There is more than one loop exit point.

void no_vec(float a[], float b[], float c[])
{
    int i = 0.;
    while (i < 100) {
        a[i] = b[i] * c[i];
        // this is a data-dependent exit condition:
        if (a[i] < 0.0)
            break;
        ++i;
    }
}

Exception: Loops searching for an array element, as in the example below, can be automatically vectorized when array a[i] is aligned.

for (i = 0; i < n; ++i) {
    if (a[i] == to_find) {
        index = I;
        break;
    }
}

C++ Example 2: A SIMD loop uses C++ exception handling or an OpenMP critical construct.

#define N 1000
int foo() {
#pragma omp simd
    for (int i = 0; i < N; i++) {
        try {
            printf ("throw exception 11\n");
            throw 11;
        }
        catch (int t) {
            printf ("caught exception %d\n", t);
            if (t != 11) {
#pragma omp critical
                {
                    printf ("TEST FAILED\n");
                    exit (0);
                }
            }
        }
    }
    printf ("TEST PASSED\n");
    exit (0);
}

C++ Example 3: The compiler cannot determine which function is passed as a function parameter.

#include <iostream>
int a[100];
int b[100];

int g(int i, int y) {
    return b[i]+y;
}

__declspec(noinline) void doit1(int x(int,int), int y) {
    int i;
#pragma parallel
    for(i = 0; i < 100; i++)
        a[i] = x(i,y);
}

Recommendations

For Example 1, where there is more than one loop exit point: Ensure loops have a single entry and a single exit point.
For Example 2, where a SIMD loop uses C++ exception handling or an OpenMP critical construct: Remove C++ exception handling and OpenMP critical recommendations from loops.
For Example 3, where the compiler cannot determine which function is passed as a function parameter: There is no resolution unless you can tell the compiler during compile time which function will be called within the loop body.

Read More:

Non-standard loop is not a vectorization candidate (Fortran)

Causes:

There is more than one loop exit point.
The iteration count is data dependent.
The loop contains a subroutine or function call that prevents vectorization.
There are other complex control structures. For example: There may be multiple
GOTO
statements.

Below are examples for the first three scenarios.
Fortran Example 1: There is more than one loop exit point.

subroutine d_15043(a,b,c,n)
    implicit none
    real, intent(in ), dimension(n) :: a, b
    real, intent(out), dimension(n) :: c
    integer, intent(in)             :: n
    integer                         :: i

    do i=1,n
        if(a(i) < 0.) exit
        c(i) = sqrt(a(i)) * b(i)
    enddo
end subroutine d_15043

Fortran Example 2: The iteration count is data dependent.

subroutine d_15043_2(a,b,c,n)
    implicit none
    real, intent(in ), dimension(n) :: a, b
    real, intent(out), dimension(n) :: c
    integer, intent(in)             :: n
    integer                         :: i

    i = 0
    do while (a(i) > 0.)
        c(i) = sqrt(a(i)) * b(i)
        i = i + 1
    enddo
end subroutine d_15043_2

Fortran Example 3: The loop contains a subroutine or function that prevents vectorization.

subroutine d_15043_3(a,b,c,n)
    implicit none
    real, intent(in ), dimension(n) :: a, b
    real, intent(out), dimension(n) :: c
    integer, intent(in)             :: n
    integer                         :: i

    do i=1,n
        call my_sub(a(i),b(i),c(i))
    enddo
end subroutine d_15043_3

Recommendations

For Example 1, where there is more than one loop exit point: Ensure:
- The loop has a single entry and a single exit point.
- The iteration count is constant and known to the loop on entry.
This loop can be vectorized if you replace exit with cycle, although the behavior is different.
For Example 2, where the iteration count is data dependent: Replace the do while construct with a counted do loop. For example:
```
do i=1,n
    if(a(i) > 0.) c(i) = sqrt(a(i)) * b(i)
enddo
```
If necessary, the iteration count can be pre-computed.
For Example 3, where the loop contains a subroutine or function call that prevents vectorization: Do one of the following:
- Inline the subroutine. For example: Use interprocedural optimization.
- Convert to a SIMD-enabled subroutine. For example: Use the !$OMP DECLARE SIMD directive.

Read More:

Vector dependence prevents vectorization

Cause: The compiler detected or assumed a vector dependence in the loop.
C++ Example:

int foo(float *A, int n) {
    int inx = 0;
    float max = A[0];
    int i;
    for (i=0;i < n;i++) {
        if (max < A[i]) {
            max = A[i];
            inx = i*i;
        }
    }
    return inx;
}

Fortran Example:

integer function foo(a, n)
    implicit none
    integer, intent(in) :: n
    real, intent(inout) :: a(n)
    real :: max
    integer :: inx, i

    max = a(0)
    do i=1,n
        if (max < a(i)) then
            max = a(i)
            inx = i*i
        endif
    end do

    foo = inx
end function

Recommendations

Rewrite code to remove dependencies.
Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
- True dependency - Read after write (RAW)
- Anti-dependency - Write after read (WAR)

If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:

Directive to prevent all dependencies in the loop

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd or #pragma omp simd	!DIR$ SIMD or !$OMP SIMD

Directive to ignore only vector dependencies (which is safer)

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma ivdep	!DIR$ IVDEP

restrict keyword

If anti-dependency exists, use a directive where k is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd vectorlength(k)	!DIR$ SIMD VECTORLENGTH(k)

Read More C++ Information:

Read More Fortran Information:

Call to function cannot be vectorized (C++)

Causes:

The loop has a call to a function that has no vector version.
A user-defined vector function cannot be vectorized because the function body invokes other functions that cannot be vectorized.

C++ Example:

#include <iostream>
#include <complex>
using namespace std;
int main() {
    float c[10];
    c[:] = 0.f;
    for(int i = 0; i < 10; i++)
        cout << c[i] << "n";
    return 0;
}

Recommendations

If possible, define a vector version for the function using a construct:

Target	ICL/ICC/ICPC Construct
Source function	#pragma omp declare simd
Source function	_declspec(vector) (Windows OS) or _attribute_(vector) (Linux OS)

Read More:

Call to function cannot be vectorized (Fortran)

Cause: A function call inside the loop is preventing auto-vectorization.
Fortran Example:

Program foo
    implicit none
    integer, parameter  :: nx = 100000000
    real(8)             :: x, xp, sumx
    integer             :: i
    interface
        real(8) function bar(x, xp)
            real(8), intent(in) :: x, xp
        end
    end interface

    sumx = 0.
    xp   = 1.
    do i = 1,nx
        x = 1.D-8*real(i,8)
        sumx = sumx + bar(x,xp)
    enddo
    print *, 'Sum =',sumx
end

real(8) function bar(x, xp)
    implicit none
    real(8), intent(in) :: x, xp

    bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3  + 0.2*(x-xp)**4
    bar = bar / sqrt(x**2 + xp**2)
end

Recommendations

If possible, define a vector version for the function using a construct:

Target	IFORT Construct
Source function	!DIR$ OMP DECLARE SIMD
Source function	ELEMENTAL keyword or !DIR$ ATTRIBUTES VECTOR

In this example you can vectorize the loop and function call using OpenMP* 4.0 or Intel® Cilk™ Plus explicit vector programming capabilities.

Add a !DIR$ OMP DECLARE SIMD directive to the function bar() and compile with the /Qopenmp-simd option to generate a vectorized version of bar() . Add the same directive to the interface block for bar() inside program foo . The UNIFORM clause specifies that xp is a non-varying argument and has the same value for each loop iteration in the caller being vectorized. Thus x is the only vector argument. Without UNIFORM , the compiler must determine if xp could also be a vector argument.

real(8) function bar(x, xp)
!$OMP DECLARE SIMD (bar) UNIFORM(xp)
    implicit none
    real(8), intent(in) :: x, xp

    bar = 1. - 2.*(x-xp) + 3.*(x-xp)**2 - 1.5*(x-xp)**3  + 0.2*(x-xp)**4
    bar = bar / sqrt(x**2 + xp**2)
end

The code now generates a vectorized version of function bar() ; however, the loop inside foo is still not vectorized because the compiler sees dependencies between loop iterations carried by both x and

sumx

. Unaided, the compiler could determine how to auto-vectorize a loop with just these dependencies, or vectorize a loop with just the function call, but not both. We can tell the compiler to vectorize the loop with a !$OMP SIMD directive that specifies the properties of x and sumx :

Program foo
    implicit none
    integer, parameter  :: nx = 100000000
    real(8)             :: x, xp, sumx
    integer             :: i

    interface
        nbsp;real(8) function bar(x, xp)
        !$OMP DECLARE SIMD (bar) UNIFORM(xp)
            real(8), intent(in) :: x, xp
        end
    end interface

    sumx = 0.
    xp   = 1.

    !$OMP SIMD  private(x)  reduction(+:sumx)
    do i = 1,nx
        x = 1.D-8*real(i,8)
        sumx = sumx + bar(x,xp)
    enddo
    print *, 'Sum =',sumx
end

The loop now vectorizes successfully, and running the application shows a performance speedup.

For small functions such as bar() , inlining may be a simpler and more efficient way to achieve vectorization of loops containing function calls. When the caller and callee are in separate source files, as above, build the application with interprocedural optimization ( -ipo or /Qipo ). When the caller and callee are in the same source file, inlining of small functions is enabled by default at optimization level O2 and above.

Read More:

Cannot compute loop iteration count before executing the loop (C++)

Causes:

The loop iteration count is not available before the loop executes.
The compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries.

C++ Example 1: The upper bound of the loop iteration count is controlled by

bar()

, whose implementation is available in this compilation unit. Because the loop iteration count is not available before the loop executes, the compiler cannot determine:

How to map the loop to vector registers
If it needs to create peeled and remainder loops
Where it has enough iterations to saturate at least one vector register

void foo(float *A) {
    int i;
    int OuterCount = 90;
    while (OuterCount > 0) {
        for (i = 1; i < bar(int(A[0])); i++) {
            A[i] = i + 4;
        }
        OuterCount--;
    }
}

C++ Example 2: The compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries.

struct Dim { int x, y, z; };
Dim dim;
double* B;

void foo (double* A) {
    for (int i = 0; i < dim.x; i++) {
        A[i] = B[i];
    }
}

Recommendations

For Example 1, where the loop iteration count is not available before the loop executes: If the loop iteration count and iterations lower bound can be calculated for the whole loop:
- Move the calculation outside the loop using an additional variable.
- Rewrite the loop to avoid
  goto
  statements or other early exits from the loop that prevent vectorization.
- Identify the loop iterations lower bound using a constant.
For example, introduce the new
limit
variable:
```
void foo(float *A) {
    int i;
    int OuterCount = 90;
    int limit = bar(int(A[0]));
    while (OuterCount > 0) {
        for (i=1; i < limit; i++) {
            A[i] = i + 4;
        }
        OuterCount--;
    }
}
```

For Example 2, where the compiler cannot determine if there is aliasing between all the pointers used inside the loop and loop boundaries: Assign the loop boundary value to a local variable. In most cases, this is enough for the compiler to determine aliasing may not occur.

You can use a directive to accomplish the same thing automatically.

Target	ICL/ICC/ICPC Directive
Source loop	#pragma simd or #pragma omp simd

Do not use global variables or indirect accesses as loop boundaries unless you also use one of the following:

Directive to ignore vector dependencies

Target	ICL/ICC/ICPC Directive
Source loop	#pragma ivdep

restrict keyword

Read More:

Cannot compute loop iteration count before executing the loop (Fortran)

Cause: The loop iteration count is not available before the loop executes.
Fortran Example:

subroutine foo(a, n)
    implicit none
    integer, intent(in) :: n
    double precision, intent(inout) :: a(n)
    integer :: bar
    integer :: i

    i=0
    100    CONTINUE
    a(i)=0
    i=i+1
    if (i < bar()) goto 100

end subroutine foo

Recommendations

If the loop iteration count and iterations lower bound can be calculated for the whole loop:

Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid
goto
statements or other early exits from the loop that prevent vectorization.
Identify the loop iterations lower bound using a constant.

Read More:

Volatile assignment was not vectorized

Cause: Any usage of volatile variables in the loop causes this diagnostic.
C++ Example:

volatile int32_t x;
int32_t a[c_size];
for (int32_t i = 0; i < c_size; ++i) {
    a[i] = exp(x + i);
    x = a[i];
}

Recommendations

Avoid using volatile variables. For example, reassign them to regular variables.
Read More:

C++ information at https://software.intel.com/en-us/articles/cdiag15529
Fortran information at https://software.intel.com/en-us/articles/fdiag15529
Vectorization Resources for Intel® Advisor Users

Compile time constraints prevent loop optimization

Cause: Internal time limits for the optimization level prevented the compiler from determining a vectorization approach for this loop.

Recommendations

When specifying code optimization, use the following compiler option to enable the compiler vectorization engine and provide detailed diagnostics about vectorization possibilities for this loop.

Windows* OS - ICL and IFORT Option	Linux* OS - ICC/ICPC and IFORT Option
/O3	-O3

Read More C++ Information:

Read More Fortran Information:

Inner loop throttling prevents vectorization of this outer loop

Cause: The inner loop has an irregular structure. For example, it may have non-constant lower and higher bounds, a non-constant step for iterations, more than one entry, some assembly parts, volatile variables, long jumps, or complex switch clauses.

Recommendations

See the inner loop message for more details and simplify the inner loop structure.
Read More:

C++ information at https://software.intel.com/en-us/articles/cdiag15536
Fortran information at https://software.intel.com/en-us/articles/fdiag15536
Vectorization Resources for Intel® Advisor Users

Outer loop was not auto-vectorized

Cause: The compiler vectorizer determined outer loop vectorization is not possible using auto-vectorization.
C++ Example:

void foo(float **a, float **b, int N) {
    int i, j;
#pragma ivdep
    for (i = 0; i < N; i++) {
        float *ap = a[i];
        float *bp = b[i];
        for (j = 0; j < N; j++) {
            ap[j] = bp[j];
        }
    }
}

Fortran Example:

subroutine foo(a, n1, n)
    implicit none
    integer, intent(in) :: n, n1
    real, intent(inout) :: a(n,n1)
    integer :: i, j
    do i=1,n
        do j=1,n
            a(j,i) = a(j-1,i)+1
        end do
    end do
end subroutine foo

Recommendations

Run a Dependencies analysis to check if the loop has real dependencies. There are two types of dependencies:
- True dependency - Read after write (RAW)
- Anti-dependency - Write after read (WAR)

If no dependencies exist, use one of the following to tell the compiler it is safe to vectorize:

Directive to prevent all dependencies in the loop

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd or #pragma omp simd	!DIR$ SIMD or !$OMP SIMD

Directive to ignore only vector dependencies (which is safer)

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma ivdep	!DIR$ IVDEP

restrict keyword

If anti-dependency exists, use a directive where k is smaller than the distance between dependent items in anti-dependency. This enables vectorization, as dependent items are put into different vectors:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Source Loop	#pragma simd vectorlength(k)	!DIR$ SIMD VECTORLENGTH(k)

If using the O3 compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Inner loop	#pragma novector	!DIR$ NOVECTOR
Outer loop	#pragma vector always	!DIR$ VECTOR ALWAYS

Read More C++ Information:

Read More Fortran Information:

Inner loop was already vectorized

Cause: The inner loop in a nested loop is vectorized.
C++ Example:

#define N 1000
float A[N][N];
void foo(int n) {
    int i,j;
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            A[i][j]++;
        }
    }
}

Fortran Example:

subroutine foo(a, n1, n)
    implicit none
    integer, intent(in) :: n, n1
    real, intent(inout) :: a(n1,n1)
    integer :: i, j

    do i=1,n
        do j=1,n
            a(j,i) = a(j,i) + 1
        end do
    end do
end subroutine foo

Recommendations

Force vectorization of the outer loop:

In some cases it is possible to collapse a nested loop structure into a single loop structure using a directive before the outer loop. The n argument is an integer that specifies how many loops to collapse into one loop for vectorization:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Outer loop	#pragma omp simd collapse(n), #pragma omp simd, or #pragma simd	!$OMP SIMD COLLAPSE(n), !$OMP SIMD, or !DIR$ SIMD

If using the O3 compiler option, use a directive before the inner and outer loops to request vectorization of the outer loop:

Target	ICL/ICC/ICPC Directive	IFORT Directive
Inner loop	#pragma novector	!DIR$ NOVECTOR
Outer loop	#pragma vector always	!DIR$ VECTOR ALWAYS

Read More C++ Information:

Read More Fortran Information:

Low trip count

Cause: The loop lacks sufficient iterations to benefit from vectorization.
C++ Example:

#define TTT char
TTT A[15];
TTT foo(int n) {
    TTT sum=0;
    int i;
    for (i = 0; i < n; i++) {
        sum+=A[i];
    }
    return sum;
}

Fortran Example:

integer (kind=1) :: A(15), sum, i
sum=0
do i=1,15
    sum=sum+A(i)
end do

Recommendations

Rewrite your code to increase the number of loop iterations to fill at least one full vector.
Run a Trip Counts analysis to check the number of iterations and loop efficiency. A loop with iterations equal to a power of 2 can vectorize even if the trip count is low.
Do not vectorize a loop with so few iterations (because it incurs overhead).

Tell the compiler to enforce vectorization using a directive, and compare performance before and after vectorization.

Target	ICL/ICC/ICPC Construct	IFORT Construct
Source loop	#pragma omp simd or #pragma simd	!$OMP SIMD or !DIR$ SIMD

Read More C++ Information:

Read More Fortran Information:

Loop with early exits cannot be vectorized unless it meets search loop idiom criteria

Cause: The compiler did not recognize a search idiom in a loop that may exit early. For example: The loop body contains:

A conditional exit or GOTO statement followed by calculations
A potential exception - the compiler considers an exception another possible exit (C++ only)

C++ Example:
Early exit

void c15520(float a[], float b[], float c[], int n)
{
    int i;
    for(i=0; i<n; i++)
    {
        if(a[i] < 0.) break;
        c[i] = sqrt(a[i]) * b[i];
    }
}

Exception

// For Compiler 16.1 and higher this example generates Diagnostic 15333 instead
__attribute__((vector)) void f1(double);
int main()
{
    int n = 10000;
    double a[n];
    #pragma simd
    for(int i = 0 ; i < n ; i++)
        f1(a[i]);
}

Fortran Example:

subroutine f15520(a,b,c,n)
  implicit none
  real, intent(in ), dimension(n) :: a, b
  real, intent(out), dimension(n) :: c
  integer, intent(in)             :: n
  integer                         :: i

  do i=1,n
     if(a(i).lt.0.) exit
     c(i) = sqrt(a(i)) * b(i)
  enddo

end subroutine f15520

Recommendations

Split the loop into two loops:
- A search loop that has an early exit but still meets the search idiom criteria
- A computational loop without early exits
Ensure the loop has a single entry and a single exit point.
Avoid exceptions within the loop body by marking functions as nothrow .

C++ Example:
Split the loop into a search loop and computational loop.

void c15520(float a[], float b[], float c[])
{
    int i, j;
    for(i=0; i<1000; i++)
    {
        if(a[i] < 0.) break;
    }

    for(j=0; j<i-1; j++)
    {
        c[j] = sqrt(a[j]) * b[j];
    }
}

Mark the function in the loop as nothrow .

__attribute__((vector, nothrow)) void f1(double);
int main()
{
    int n = 10000;
    double a[n];
    #pragma simd
    for(int i = 0 ; i < n ; i++)
        f1(a[i]);
}

Fortran Example:
Split the loop into a search loop and computational loop.

subroutine f15520(a,b,c,n)
    implicit none
    real, intent(in ), dimension(n) :: a, b
    real, intent(out), dimension(n) :: c
    integer, intent(in)             :: n
    integer                         :: i, j

    do i=1,n
        if(a(i).lt.0.) exit
    enddo
         
    do j=1,i-1
        c(j) = sqrt(a(j)) * b(j)
    enddo

end subroutine f15520

Read More C++ Information:

Read More Fortran Information:

Exception handling for a call prevents vectorization

Cause: The compiler automatically generates a try block for a program block (that is, code inside {}) when it allocates a large, local object or array on the heap (because the object is too big to allocate on the stack) and a function within the block could throw an exception.
C++ Example:

__attribute__((vector)) void f1(double);
int main()
{
    int n = 10000;
    double a[n];
    #pragma simd
    for(int i = 0 ; i < n ; i++)
        f1(a[i]);
}

Recommendations

Avoid exceptions within a vectorizable loop body by marking functions as nothrow .

__attribute__((vector, nothrow)) void f1(double);

Read More C++ Information:

Non-vectorizable loop instance from multiversioning (C++)

Cause: The compiler doesn't get enough information from the code to create one version of the loop. In the example below, the compiler takes a defensive stand and generates both vectorized and non-vectorized versions of the loop because it assumes memory aliasing (the pointers could be pointing to overlapping memory locations).
C++ Example:

void foo(float *a, float *b, float *c){
    for(int i = 0 ; i < 256; i++)
        c[i] = a[i] * b[i];
    return;
}

Recommendations

If you are sure that there is no memory aliasing, then use __restrict__ keywords to qualify the pointers passed as arguments as non-overlapping in memory.
Read More C++ Information:

Non-vectorizable loop instance from multiversioning (Fortran)

Cause: The compiler doesn't get enough information from the code to create one version of the loop. In the example below, the compiler takes a defensive stand and generates thee versions of the loop, for k=0, k>0k<0. The version for k<0 cannot be safely vectorized because each later iteration may depend on the result of earlier iterations.
Fortran Example:

subroutine add(k, a) 
    integer :: k 
    real :: a(20)
   
      DO i = 1, 20
       a(i) = a(i+k) * 2.0
        end do 
end subroutine add

Recommendations

To override the compiler default behavior, insert the !DIR$ IVDEP directive. The IVDEP directive tells the compiler it can safely ignore potential dependencies, so it does not need to generate special code for the case of k<0.
Read More Fortran Information: