Vectorization and Loops

Interactions with Loop Parallelization

This topic provides more information on the interaction between the auto-vectorizer and loops.

Combine the [Q]parallel and [Q]x options to instruct the Intel® C++ Compiler to attempt both Auto-Parallelization and automatic loop vectorization in the same compilation.

Note

Using this option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch (Windows*), -m (Linux* and OS X*), or [Q]x.

Note

Using this option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch (Windows*), -m (Linux* and OS X*), or [Q]x.

In most cases, the compiler will consider outermost loops for parallelization and innermost loops for vectorization. If deemed profitable, however, the compiler may even apply loop parallelization and vectorization to the same loop.

See Programming with Auto-parallelization and Programming Guidelines for Vectorization.

In some rare cases, a successful loop parallelization (either automatically or by means of OpenMP* directives) may affect the messages reported by the compiler for a non-vectorizable loop in a non-intuitive way; for example, in the cases where the Qopt-report:2 Qopt-report-phase:vec (Windows) or qopt-report=2 qopt-report-phase=vec (Linux and OS X) options indicate that loops were not successfully vectorized.

Types of Vectorized Loops

For integer loops, the 128-bit Intel® Streaming SIMD Extensions (Intel® SSE) and the Intel® Advanced Vector Extensions (Intel® AVX) provide SIMD instructions for most arithmetic and logical operators on 32-bit, 16-bit, and 8-bit integer data types, with limited support for the 64-bit integer data type.

Vectorization may proceed if the final precision of integer wrap-around arithmetic is preserved. A 32-bit shift-right operator, for instance, is not vectorized in 16-bit mode if the final stored value is a 16-bit integer. Also, note that because the Intel® SSE and the Intel® AVX instruction sets are not fully orthogonal (shifts on byte operands, for instance, are not supported), not all integer operations can actually be vectorized.

For loops that operate on 32-bit single-precision and 64-bit double-precision floating-point numbers, Intel® SSE provides SIMD instructions for the following arithmetic operators:

addition (+)
subtraction (-)
multiplication (*)
division (/)

Additionally, Intel® SSE provide SIMD instructions for the binary MIN and MAX and unary SQRT operators. SIMD versions of several other mathematical operators (like the trigonometric functions SIN, COS, and TAN) are supported in software in a vector mathematical run-time library that is provided with the Intel® Compiler.

To be vectorizable, loops must be:

Countable: The loop trip count must be known at entry to the loop at runtime, though it need not be known at compile time (that is, the trip count can be a variable but the variable must remain constant for the duration of the loop). This implies that exit from the loop must not be data-dependent.

Single entry and single exit: as is implied by stating that the loop must be countable. Consider the following example of a loop that is not vectorizable, due to a second, data-dependent exit:

Example 1: Non-vectorizable Loop
void no_vec(float a[], float b[], float c[]){ int i = 0.; while (i < 100) { a[i] = b[i] * c[i]; // this is a data-dependent exit condition: if (a[i] < 0.0) break; ++i; } }
> icc -c -O2 -qopt-report=2 -qopt-report-phase=vec two_exits.cpp two_exits.cpp(4) (col. 9): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

Example 1: Non-vectorizable Loop

void no_vec(float a[], float b[], float c[]){
  int i = 0.;
  while (i < 100) {
    a[i] = b[i] * c[i];
    //  this is a data-dependent exit condition:
    if (a[i] < 0.0)
    break;
    ++i;
  } 
}

> icc -c -O2 -qopt-report=2 -qopt-report-phase=vec two_exits.cpp 
two_exits.cpp(4) (col. 9): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

Contain straight-line code: SIMD instruction perform the same operation on data elements from multiple iterations of the original loop, therefore, it is not possible for different iterations to have different control flow; that is, they must not branch. It follows that switch statements are not allowed. However, if statements are allowed if they can be implemented as masked assignments, which is usually the case. The calculation is performed for all data elements but the result is stored only for those elements for which the mask evaluates to true. To illustrate this point, consider the following example that may be vectorized:

Example 2: Evaluation of a Vectorizable Loop
#include <math.h> void quad(int length, float a, float b, float c, float restrict x1, float restrict x2) { for (int i=0; i<length; i++) { float s = b[i]b[i] - 4a[i]c[i]; if ( s >= 0 ) { s = sqrt(s) ; x2[i] = (-b[i]+s)/(2.a[i]); x1[i] = (-b[i]-s)/(2.a[i]); } else { x2[i] = 0.; x1[i] = 0.; } } }
> icc -c -restrict -qopt-report=2 -qopt-report-phase=vec quad.cpp quad5.cpp(5) (col. 3): remark: LOOP WAS VECTORIZED.

Example 2: Evaluation of a Vectorizable Loop

#include <math.h> 
void quad(int length, float *a, float *b, float *c, float *restrict x1, float *restrict x2) { 
  for (int i=0; i<length; i++) {
    float s = b[i]*b[i] - 4*a[i]*c[i];
    if ( s >= 0 ) {
      s = sqrt(s) ;
      x2[i] = (-b[i]+s)/(2.*a[i]);
      x1[i] = (-b[i]-s)/(2.*a[i]);
    } else {
      x2[i] = 0.;
      x1[i] = 0.;
    } 
  } 
}

> icc -c -restrict -qopt-report=2 -qopt-report-phase=vec quad.cpp 
quad5.cpp(5) (col. 3): remark: LOOP WAS VECTORIZED.

Innermost loop of a nest: The only exception is if an original outer loop is transformed into an inner loop as a result of some other prior optimization phase, such as unrolling, loop collapsing or interchange, or an original outermost loop is transformed to an innermost loop due to loop materialization.
Without function calls: Even a print statement is sufficient to prevent a loop from getting vectorized. The vectorization report message is typically: non-standard loop is not a vectorization candidate. The two major exceptions are for intrinsic math functions and for functions that may be inlined.

Intrinsic math functions such as sin(), log(), fmax(), and so on, are allowed, because the compiler runtime library contains vectorized versions of these functions. See the table below for a list of these functions; most exist in both float and double versions.

`acos`	`ceil`	`fabs`	`round`
`acosh`	`cos`	`floor`	`sin`
`asin`	`cosh`	`fmax`	`sinh`
`asinh`	`erf`	`fmin`	`sqrt`
`atan`	`erfc`	`log`	`tan`
`atan2`	`erfinv`	`log10`	`tanh`
`atanh`	`exp`	`log2`	`trunc`
`cbrt`	`exp2`	`pow`

The loop in the following example may be vectorized because sqrt() is vectorizable and func() gets inlined. Inlining is enabled at default optimization for functions in the same source file. An inlining report may be obtained by setting the options Qopt-report:2 Qopt-report-phase:ipo (Windows) or qopt-report=2 qopt-report-phase=ipo (Linux).

Example 3: Inlining of a Vectorizable Loop
float func(float x, float y, float xp, float yp) { float denom; denom = (x-xp)(x-xp) + (y-yp)(y-yp); denom = 1./sqrtf(denom); return denom; } float trap_int(float y, float x0, float xn, int nx, float xp, float yp) { float x, h, sumx; int i; h = (xn-x0) / nx; sumx = 0.5( func(x0,y,xp,yp) + func(xn,y,xp,yp) ); for (i=1;i<nx;i++) { x = x0 + ih; sumx = sumx + func(x,y,xp,yp); } sumx = sumx * h; return sumx; }
// Command line > icc -c -qopt-report=2 -qopt-report-phase=vec trap_integ.c trap_int.c(16) (col. 3): remark: LOOP WAS VECTORIZED.

Example 3: Inlining of a Vectorizable Loop

float func(float x, float y, float xp, float yp) {
  float denom;
  denom = (x-xp)*(x-xp) + (y-yp)*(y-yp);
  denom = 1./sqrtf(denom);
  return denom; 
} 

float trap_int(float y, float x0, float xn, int nx, float xp, float yp) {
  float x, h, sumx;
  int i;
  h = (xn-x0) / nx;
  sumx = 0.5*( func(x0,y,xp,yp) + func(xn,y,xp,yp) );
  for (i=1;i<nx;i++) {
    x = x0 + i*h;
    sumx = sumx + func(x,y,xp,yp);
  }
  sumx = sumx * h;
  return sumx; 
}

// Command line 
> icc -c -qopt-report=2 -qopt-report-phase=vec trap_integ.c 
trap_int.c(16) (col. 3): remark: LOOP WAS VECTORIZED.

Statements in the Loop Body

The vectorizable operations are different for floating-point and integer data.

Integer Array Operations

The statements within the loop body may contain char, unsigned char, short, unsigned short, int, and unsigned int. Calls to functions such as sqrt and fabs are also supported. Arithmetic operations are limited to addition, subtraction, bitwise AND, OR, and XOR operators, division (via run-time library call), multiplication, min, and max. You can mix data types but this may potentially cost you in terms of lowering efficiency. Some example operators where you can mix data types are multiplication, shift, or unary operators.

Other Operations

No statements other than the preceding floating-point and integer operations are allowed. In particular, note that the special __m64 __m128, and __m256 data types are not vectorizable. The loop body cannot contain any function calls. Use of Intel® SSE intrinsics ( for example, _mm_add_ps) or Intel® AVX intrinsics (for example, _mm256_add_ps) are not allowed.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Vectorization and Loops

Interactions with Loop Parallelization

Note

Note

Types of Vectorized Loops

Statements in the Loop Body

Other Operations

See Also