Using Automatic Vectorization

Vectorization Speed-up

Where does the vectorization speedup come from? Consider the following sample code fragment, where a, b and c are integer arrays:

Sample Code Fragment
for (I=0;i<=MAX;i++) c[i]=a[i]+b[i];

If vectorization is not enabled (that is, you compile using O1 or [Q]vec- options), for each iteration, the compiler processes the code such that there is a lot of unused space in the SIMD registers, even though each of the registers could hold three additional integers. If vectorization is enabled (compiled using O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (O2) or higher.

Note

Using this option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch (Windows*), -m (Linux* and OS X*), or [Q]x.

Tip

To allow comparisons between vectorized and not-vectorized code, disable vectorization using the /Qvec- (Windows*) or -no-vec (Linux* or OS X*) option; enable vectorization using the O2 option.

To get information on whether a loop was vectorized or not, enable generation of the optimization report using the options Qopt-report:1 Qopt-report-phase:vec (Windows) or qopt-report=1 qopt-report-phase=vec (Linux and OS X) options. These options generate a separate report in an *.optrpt file that includes optimization messages. In Visual Studio, the program source is annotated with the report's messages, or you can read the resulting .optrpt file using a text editor. A message appears for every loop that is vectorized, such as:

Example: Vectorization Report
> icl /Qopt-report:1 /Qopt-report-phase:vec Multiply.c Multiply.c(92): (col. 5) remark: LOOP WAS VECTORIZED.

The source line number (92 in the above example) refers to either the beginning or the end of the loop.

To get details about the type of loop transformations and optimizations that took place, use the [Q]opt-report-phase option by itself or along with the [Q]opt-report option.

To get information on whether the loop was vectorized using the Visual Studio* IDE, select Project > Properties > C/C++ > Diagnostics > Optimization Diagnostic Level as Level 1 (/Qopt-report:1) and Optimization Diagnostic Phase as Loop Nest Optimization (/Qopt-report-phase:loop). To get a diagnostic message for every loop that was not vectorized, with a brief explanation of why the loop was not vectorized, select /Qopt-report-phase:vec.

How significant is the performance enhancement? To evaluate performance enhancement yourself, run vec_samples:

Open an Intel® Compiler command line window.
- On Windows*: Under the Start menu item for your Intel product, select an icon under Compiler and Performance Libraries > Command Prompt with Intel Compiler
- On Linux* and OS X*: Source an environment script such as compilervars.sh or the compilervars.csh in the <installdir>/bin directory and use the attribute appropriate for the architecture.

Navigate to the <install-dir>\Samples\<locale>\C++\ directory. On Windows, unzip the sample project vec_samples.zip to a writable directory. This small application multiplies a vector by a matrix using the following loop:

Example: Vector Matrix Multiplication
for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j]; }

Build and run the application, first without enabling auto-vectorization. The default O2 optimization enables vectorization, so you need to disable it with a separate option. Note the time taken for the application to run.

Example: Building and Running an Application without Auto-vectorization
*// (Linux and OS X* with EDG compiler)** icc -O2 -no-vec Multiply.c -o NoVectMult ./NoVectMult
*// (OS X with CLANG compiler)** icl -O2 -no-vec Multiply.c -o NoVectMult ./NoVectMult
*// (Windows)** icl /O2 /Qvec- Multiply.c /FeNoVectMult NoVectMult

Example: Building and Running an Application without Auto-vectorization

// (Linux* and OS X* with EDG compiler)
icc -O2 -no-vec  Multiply.c -o NoVectMult 
./NoVectMult

// (OS X* with CLANG compiler)
icl -O2 -no-vec  Multiply.c -o NoVectMult 
./NoVectMult

// (Windows*)
icl /O2 /Qvec- Multiply.c /FeNoVectMult 
NoVectMult

Now build and run the application, this time with auto-vectorization. Note the time taken for the application to run.

Example: Building and Running an Application with Auto-vectorization
*// (Linux and OS X* with EDG compiler)** vicc -O2 -qopt-report=1 -qopt-report-phase=vec Multiply.c -o VectMult ./VectMult
*// (OS X with CLANG compiler)** icc -O2 -qopt-report=1 -qopt-report-phase=vec Multiply.c -o VectMult ./VectMult
*// (Windows)** icl /O2 /Qopt-report:1 /Qopt-report-phase:vec Multiply.c /FeVectMult VectMult

Example: Building and Running an Application with Auto-vectorization

// (Linux* and OS X* with EDG compiler)
vicc -O2 -qopt-report=1 -qopt-report-phase=vec Multiply.c -o VectMult 
./VectMult

// (OS X* with CLANG compiler)
icc -O2 -qopt-report=1 -qopt-report-phase=vec Multiply.c -o VectMult 
./VectMult

// (Windows*)
icl /O2 /Qopt-report:1 /Qopt-report-phase:vec Multiply.c /FeVectMult 
VectMult

When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the O1 option.

Obstacles to Vectorization

The following do not always prevent vectorization, but frequently either prevent it or cause the compiler to decide that vectorization would not be worthwhile.

Non-contiguous memory access: Four consecutive integers or floating-point values, or two consecutive doubles, may be loaded directly from memory in a single SSE instruction. But if the four integers are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, as in the examples below. The compiler rarely vectorizes such loops, unless the amount of computational work is large compared to the overhead from non-contiguous memory access.

Example: Non-contiguous Memory Access
// arrays accessed with stride 2 for (int I=0; i<SIZE; I+=2) b[i] += a[i] * x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) { for (int I=0; i<SIZE; I++) b[i] += a[i][j] * x[j]; } // indirect addressing of x using index array for (int I=0; i<SIZE; I+=2) b[i] += a[i] * x[index[i]];

Example: Non-contiguous Memory Access

// arrays accessed with stride 2 
for (int I=0; i<SIZE; I+=2)  b[i] += a[i] * x[i]; 

// inner loop accesses a with stride SIZE 
for (int j=0; j<SIZE; j++)  {
  for (int I=0; i<SIZE; I++)   b[i] += a[i][j] * x[j]; 
} 

// indirect addressing of x using index array
  for (int I=0; i<SIZE; I+=2)  b[i] += a[i] * x[index[i]];

The typical message from the vectorization report is: vectorization possible but seems inefficient, although indirect addressing may also result in the following report: Existence of vector dependence.

Data dependencies: Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.

The simplest case is when data elements that are written (stored to) do not appear in any other iteration of the individual loop. In this case, all the iterations of the original loop are independent of each other, and can be executed in any order, without changing the result. The loop may be safely executed using any parallel method, including vectorization. All the examples considered so far fall into this category.

When a variable is written in one iteration and read in a subsequent iteration, there is a “read-after-write” dependency, also known as a flow dependency, as in this example:

Example: Flow Dependency
A[0]=0; for (j=1; j<MAX; j++) A[j]=A[j-1]+1; // this is equivalent to: A[1]=A[0]+1; A[2]=A[1]+1; A[3]=A[2]+1; A[4]=A[3]+1;

So the value of j gets propagated to all A[j]. This cannot safely be vectorized: if the first two iterations are executed simultaneously by a SIMD instruction, the value of A[1] is used by the second iteration before it has been calculated by the first iteration.

When a variable is read in one iteration and written in a subsequent iteration, this is a write-after-read dependency, also known as an anti-dependency, as in the following example:

Example: Write-after-read Dependency
for (j=1; j<MAX; j++) A[j-1]=A[j]+1; // this is equivalent to: A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;

This write-after-read dependency is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. However, for vectorization, no iteration with a higher value of j can complete before an iteration with a lower value of j, and so vectorization is safe (that is, it gives the same result as non-vectorized code) in this case. The following example, however, may not be safe, since vectorization might cause some elements of A to be overwritten by the first SIMD instruction before being used for the second SIMD instruction.

Example: Unsafe Vectorization
for (j=1; j<MAX; j++) { A[j-1]=A[j]+1; B[j]=A[j]*2; } // this is equivalent to: A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;

Read-after-read situations are not really dependencies, and do not prevent vectorization or parallel execution. If a variable is unwritten, it does not matter how often it is read.
Write-after-write, or ‘output’, dependencies, where the same variable is written to in more than one iteration, are in general unsafe for parallel execution, including vectorization.

One important exception, that apparently contains all of the above types of dependency:

Example: Dependency Exception
sum=0; for (j=1; j<MAX; j++) sum = sum + A[j]*B[j]

Although sum is both read and written in every iteration, the compiler recognizes such reduction idioms, and is able to vectorize them safely. The loop in the first example was another example of a reduction, with a loop-invariant array element in place of a scalar.

These types of dependencies between loop iterations are sometimes known as loop-carried dependencies.

The above examples are of proven dependencies. The compiler cannot safely vectorize a loop if there is even a potential dependency. Consider the following example:

Example: Potential Dependency
for (I = 0; I < size; I++) { c[i] = a[i] * b[i]; }

In the above example, the compiler needs to determine whether, for some iteration I, c[i] might refer to the same memory location as a[i] orb[i] for a different iteration. Such memory locations are sometimes said to be aliased. For example, if a[i] pointed to the same memory location as c[i-1], there would be a read-after-write dependency as in the earlier example. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints.

Helping the Intel® C++ Compiler to Vectorize

Sometimes the Intel® C++ Compiler has insufficient information to decide to vectorize a loop. There are several ways to provide additional information to the compiler:

Pragmas:

#pragma ivdep: may be used to tell the compiler that it may safely ignore any potential data dependencies. (The compiler will not ignore proven dependencies). Use of this pragma when there are dependencies may lead to incorrect results.

There are cases where the compiler cannot tell by a static dependency analysis that it is safe to vectorize. Consider the following loop:

Loop Example
void copy(char cp_a, char cp_b, int n) { for (int I = 0; I < n; I++) { cp_a[i] = cp_b[i]; } }

Without more information, a vectorizing compiler must conservatively assume that the memory regions accessed by the pointer variablescp_aand cp_b may (partially) overlap, which gives rise to potential data dependencies that prohibit straightforward conversion of this loop into SIMD instructions. At this point, the compiler may decide to keep the loop serial or, as done by the Intel® C++ compiler, generate a run-time test for overlap, where the loop in the true-branch can be converted into SIMD instructions:

Example: True-branch Loop
if (cp_a + n < cp_b \|\| cp_b + n < cp_a) /* vector loop / for (int I = 0; I < n; I++) cp_a[i] = cp_b [I]; else / serial loop */ for (int I = 0; I < n; I++) cp_a[i] = cp_b[i];

Run-time data-dependency testing provides a generally effective way to exploit implicit parallelism in C or C++ code at the expense of a slight increase in code size and testing overhead. If the function copy is only used in specific ways, however, you can assist the vectorizing compiler as follows:

If the function is mainly used for small values of n or for overlapping memory regions, you can simply prevent vectorization and, hence, the corresponding run-time overhead by inserting a #pragma novector hint before the loop.

Conversely, if the loop is guaranteed to operate on non-overlapping memory regions, you can provide this information to the compiler by means of a #pragma ivdep hint before the loop, which informs the compiler that conservatively assumed data dependencies that prevent vectorization can be ignored. This results in vectorization of the loop without run-time data-dependency testing.

Example: Ignoring Data Dependencies with `#pragma ivdep`
#pragma ivdep void copy(char cp_a, char cp_b, int n) { for (int I = 0; I < n; I++) { cp_a[i] = cp_b[i]; } }

Note

You can also use the restrict keyword.

#pragma loop count (n): may be used to advise the compiler of the typical trip count of the loop. This may help the compiler to decide whether vectorization is worthwhile, or whether or not it should generate alternative code paths for the loop.
#pragma vector always: asks the compiler to vectorize the loop if it is safe to do so, whether or not the compiler thinks that will improve performance.
#pragma vector align: asserts that data within the following loop is aligned (to a 16-byte boundary, for Intel® SSE instruction sets).
#pragma novector: asks the compiler not to vectorize a particular loop.
#pragma vector nontemporal: gives a hint to the compiler that data will not be reused, and therefore to use streaming stores that bypass cache.

Keywords: The restrict keyword may be used to assert that the memory referenced by a pointer is not aliased, i.e. that it is not accessed in any other way. The keyword requires the use of either the [Q]restrict or [Q]c99 compiler options. The example under #pragma ivdep above can also be handled using the restrict keyword.

You may use the restrict keyword in the declarations of cp_a and cp_b, as shown below, to inform the compiler that each pointer variable provides exclusive access to a certain memory region. The restrict qualifier in the argument list lets the compiler know that there are no other aliases to the memory to which the pointers point. In other words, the pointer for which it is used provides the only means of accessing the memory in question in the scope in which the pointers live. Even if the code gets vectorized without the restrict keyword, the compiler checks for aliasing at run-time, if the restrict keyword was used. You may have to use an extra compiler option, such as [Q]restrict option for the Intel® C++ Compiler.

Example: Restrict Keyword
void copy(char * __restrict cp_a, char * __restrict cp_b, int n) { for (int I = 0; I < n; I++) cp_a[i] = cp_b[i]; }

This method is convenient in case the exclusive access property holds for pointer variables that are used in a large portion of code with many loops because it avoids the need to annotate each of the vectorizable loops individually. Note, however, that both the loop-specific #pragma ivdep hint, as well as the pointer variable-specific restrict hint must be used with care because incorrect usage may change the semantics intended in the original program.

Another example is the following loop that may also not get vectorized because of a potential aliasing problem between pointers a, b and c:

Example: Potential Unsupported Loop Structure
void add(float a, float b, float *c) { for (int I=0; i<SIZE; I++) { c[i] += a[i] + b[i]; } }

If the restrict keyword is added to the parameters, the compiler will trust you, that you will not access the memory in question with any other pointer and vectorize the code properly:

Example: Using Pointers with the Restrict Keyword
// let the compiler know, the pointers are safe with restrict void add(float * __restrict a, float * __restrict b, float * __restrict c) { for (int I=0; i<SIZE; I++) { c[i] += a[i] + b[i]; } }

The down-side of using restrict is that not all compilers support this keyword, so your source code may lose portability. If you care about source code portability you may want to consider using the [Q]ansi-alias compiler option instead. However, compiler options work globally, so you have to make sure they do not cause harm to other code fragments.

Options/switches: You can use options to enable different levels of optimizations to achieve automatic vectorization:
- Interprocedural optimization (IPO): Enable IPO using [Q]ip option within a single source file, or using [Q]ipo across source files. You provide the compiler with additional information (trip counts, alignment, or data dependencies) about a loop. Enabling IPO may also allow inlining of function calls.
- Disambiguation of pointers and arrays: Use the options /Oa (Windows*) or –fno-alias (Linux* or OS* X) to assert there is no aliasing of memory references, that is, the same memory location is not accessed via different arrays or pointers. Other options make more limited assertions, for example, /Qalias-args- (Windows*) or -fargument-noalias (Linux* or OS X*) asserts that function arguments cannot alias each other (that is, they cannot overlap).
  The /Qansi-alias (-fargument-alias) options allow the compiler to assume strict adherence to the aliasing rules in the ISO C standard. Use these options responsibly; if you use these options when memory is aliased it may lead to incorrect results.
  
  Note
  
  When you specify the [Q]ansi-alias option, the ansi-alias checker is enabled by default. To disable the ansi-alias checker, you must specify -no-ansi-alias-check (Linux* and OS X*) or /Qansi-alias-check (Windows*).
  
  Use the [Q]ansi-alias-check option to enable the ansi-alias checker. The ansi-alias checker checks the source code for potential violations of ANSI aliasing rules and disables unsafe optimizations related to the code for those statements that are identified as potential violations.
- High-level optimizations (HLO): Enable HLO with option O3. This will enable additional loop optimizations that make it easier for the compiler to vectorize the transformed loops. The HLO report, obtained using the [Q]opt-report-phase[:]loop option or the corresponding IDE selection, tells you whether some of these additional transformations occurred.

Using Automatic Vectorization

Vectorization Speed-up

Note

Tip

Obstacles to Vectorization

Helping the Intel® C++ Compiler to Vectorize

Note

Note

See Also