Issue: Ineffective peeled/remainder loop(s) present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/ remainder loops to the loop body.

Recommendation: Specify the expected loop trip count

Confidence:

%level%

The compiler cannot statically detect the trip count. To fix: Identify the expected number of iterations using a directive: #pragma loop_count.

Example: Iterate through a loop a minimum of three, maximum of ten, and average of five times:

#include <stdio.h>
		
int mysum(int start, int end, int a) 
{
    int iret=0; 
    #pragma loop_count min(3), max(10), avg(5)
    for (int i=start;i<=end;i++)
        iret += a;
    return iret; 
} 
              
int main() 
{
    int t;
    t = mysum(1, 10, 3);
    printf("t1=%d\r\n",t);
    t = mysum(2, 6, 2);
    printf("t2=%d\r\n",t);
    t = mysum(5, 12, 1);
    printf("t3=%d\r\n",t);
}

Read More:

Recommendation: Disable unrolling

Confidence:

%level%

The trip count after loop unrolling is too small compared to the vector length. To fix: Prevent loop unrolling or decrease the unroll factor using a directive: #pragma nounroll or #pragma unroll.

Example: Disable automatic loop unrolling using #pragma nounroll

void nounroll(int a[], int b[], int c[], int d[]) 
{
    #pragma nounroll
    for (int i = 1; i < 100; i++) 
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    } 
}

Read More:

Recommendation: Use a smaller vector length

Confidence:

%level%

The compiler chose a vector length, but the trip count might be smaller than that vector length. To fix: Specify a smaller vector length using a directive: #pragma simd vectorlength.

Example: Specify vector length using #pragma simd vectorlength(4)

void f(int a[], int b[], int c[], int d[]) 
{
    #pragma simd vectorlength(4)
    for (int i = 1; i < 100; i++) 
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    } 
}

Read More:

Recommendation: Align data

Confidence:

%level%

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.

Dynamic Data:

To align dynamic data, replace malloc() and free() with _mm_malloc() and _mm_free(). To tell the compiler the data is aligned, use __assume_aligned() before the source loop. Also consider using #include <aligned_new> to enable automatic allocation of aligned data.

Static Data:

To align static data, use __declspec(align()). To tell the compiler the data is aligned, use __assume_aligned() before the source loop.

Example - Dynamic Data:

Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:

float *array;
array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32);          
// Somewhere else
__assume_aligned(array, 32);
// Use array in loop
_mm_free(array);

Example - Static Data:

Align static data using a 64-byte boundary:

__declspec(align(64)) float array[ARRAY_SIZE]

Read More:

Recommendation: Add data padding

Confidence:

%level%

The trip count is not a multiple of vector length. To fix: Do one of the following:

Increase the size of objects and add iterations so the trip count is a multiple of vector length.
Increase the size of static and automatic objects, and use a compiler option to add data padding.

*Windows OS**	*Linux OS**
/Qopt-assume-safe-padding	-qopt-assume-safe-padding

Note: These compiler options apply only to Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Option -qopt-assume-safe-padding is the replacement compiler option for-opt-assume-safe-padding, which is deprecated.

When you use one of these compiler options, the compiler does not add any padding for static and automatic objects. Instead, it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in your application. To satisfy this assumption, you must increase the size of static and automatic objects in your application.

Optional: Specify the trip count, if it is not constant, using a directive: #pragma loop_count

Read More:

Recommendation: Collect trip counts data

Confidence:

%level%

The Survey Report lacks trip counts data that might generate more precise recommendations. To fix: Run a Trip Counts analysis.

Recommendation: Force vectorized remainder

Confidence:

%level%

The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: #pragma simd vecremainder or #pragma vector vecremainder.

Example: Force the compiler to vectorize the remainder loop using #pragma simd vecremainder

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
    int i; 
    #pragma simd vecremainder 
    for (i=0; i<n; i++)
    {
        a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
    } 
}

Read More:

Issue: Data type conversions present

There are multiple data types within loops. Utilize hardware vectorization support more effectively by avoiding data type conversion.

Recommendation: Use the smallest data type

Confidence:

%level%

The source loop contains data types of different widths. To fix: Use the smallest data type that gives the needed precision to use the entire vector register width.

Example: If only 16-bits are needed, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.

Issue: User function call(s) present

User-defined functions in the loop body are preventing the compiler from vectorizing the loop

Recommendation: Enable inline expansion

Confidence:

%level%

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

*Windows OS**	*Linux OS**
/Ob1 or /Ob2	-inline-level=1 or -inline-level=2

Read More:

Recommendation: Vectorize user function(s) inside loop

Confidence:

%level%

Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:

Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:

Target	Directive
Source loop	#pragma simd or #pragma omp simd
Inner function definition or declaration	#pragma omp declare simd

If using the Ob or inline-level compiler option to control inline expansion with the 1 argument, use an inline keyword to enable inlining or replace the 1 argument with 2 to enable inlining of any function at compiler discretion.

Example:

#pragma omp declare simd 
int f (int x) 
{
    return x+1;
}
#pragma omp simd
for (int k = 0; k < N; k++)
{
    a[k] = f(k);
}

Read More:

Issue: Serialized user function call(s) present

User-defined functions in the loop body are not vectorized.

Recommendation: Enable inline expansion

Confidence:

%level%

Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

Windows* OS	Linux* OS
`/Ob1` or `/Ob2`	`-inline-level=1` or `-inline-level=2`

Read More:

Recommendation: Vectorize serialized function(s) inside loop

Confidence:

%level%

Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:

Enforce vectorization of the source loop by means of SIMD instructions and/or create a SIMD version of the function(s) using a directive:

Target	Directive
Source loop	#pragma simd or #pragma omp simd
Inner function definition or declaration	#pragma omp declare simd

If using the Ob or inline-level compiler option to control inline expansion with the 1 argument, use an inline keyword to enable inlining or replace the 1 argument with 2 to enable inlining of any function at compiler discretion.

Example:

#pragma omp declare simd 
int f (int x) 
{
    return x+1;
}
#pragma omp simd
for (int k = 0; k < N; k++)
{
    a[k] = f(k);
}

Read More:

Issue: Scalar math function call(s) present

Math functions in the loop body are preventing the compiler from effectively vectorizing the loop. Improve performance by enabling vectorized math call(s).

Recommendation: Enable inline expansion

Confidence:

%level%

Inlining is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.

Windows* OS	Linux* OS
/Ob1 or /Ob2	-inline-level=1 or -inline-level=2

Alternatively use #include <mathimf.h> header instead of the standard #include <math.h> header to call highly optimized and accurate mathematical functions commonly used in applications that rely heaving on floating point computations.

Read More:

Recommendation: Use the Intel short vector math library for vector intrinsics

Confidence:

%level%

Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:

Use the -mveclibabi=svml compiler option to specify the Intel short vector math library ABI type for vector instrinsics.
Use the -ftree-vectorize and -funsafe-math-optimizations compiler options to enable vector math functions.
Use the -L/path/to/intel/lib and -lsvml compiler options to specify an SVML ABI-compatible library at link time.

Example:

gcc program.c -O2 -ftree-vectorize -funsafe-math-optimizations -mveclibabi=svml -L/opt/intel/lib/intel64 -lm -lsvml -Wl,-rpath=/opt/intel/lib/intel64

#include "math.h"
#include "stdio.h"
#define N 100000

int main()
{
   double angles[N], results[N];
   int i;
   srand(86456);

   for (i = 0; i < N; i++)
   {
      angles[i] = rand();
   }

   // the loop will be auto-vectorized
   for (i = 0; i < N; i++)
   {
      results[i] = cos(angles[i]);
   }

   return 0;
}

Read More:

Recommendation: Use a Glibc library with vectorized SVML functions

Confidence:

%level%

Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:

Upgrade the Glibc library to version 2.22 or higher. It supports SIMD directives in OpenMP* 4.0 or higher.
Upgrade the GNU* gcc compiler to version 4.9 or higher. It supports vectorized math function options.
Use the -fopenmp and -ffast-math compiler options to enable vector math functions.
Use appropriate OpenMP SIMD directives to enable vectorization.

Note : Also use the -I/path/to/glibc/install/include and -L/path/to/glibc/install/lib compiler options if you have multiple Glibc libraries installed on the host.

Example:

gcc program.c -O2 -fopenmp -ffast-math -lrt -lm -mavx2 -I/opt/glibc-2.22/include -L/opt/glibc-2.22/lib -Wl,--dynamic-linker=/opt/glibc-2.22/lib/ld-linux-x86-64.so.2

#include "math.h"
#include "stdio.h"
#define N 100000

int main()
{
   double angles[N], results[N];
   int i;
   srand(86456);

   for (i = 0; i < N; i++)
   {
      angles[i] = rand();
   }

   #pragma omp simd
   for (i = 0; i < N; i++)
   {
      results[i] = cos(angles[i]);
   }

   return 0;
}

Read More:

Recommendation: Vectorize math function calls inside loops

Confidence:

%level%

Your application calls serialized versions of math functions when you use the precise floating point model. To fix: Do one of the following:

Add fast-transcendentals compiler option to replace calls to transcendental functions with faster calls.

Windows* OS Linux* OS

/Qfast-transcendentals -fast-transcendentals

CAUTION: This may reduce floating point accuracy.
Enforce vectorization of the source loop using a directive: #pragma simd or #pragma omp simd

Windows* OS	Linux* OS
/Qfast-transcendentals	-fast-transcendentals

Example:

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
  int i; 
  #pragma omp simd
  for (i=0; i<n; i++)
  {
    a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
  } 
}

Read More:

Recommendation: Change the floating point model

Confidence:

%level%

Your application calls serialized versions of math functions when you use the strict floating point model. To fix: Do one of the following:

Use the fast floating point model to enable more aggressive optimizations or the precise floating point model to disable optimizations that are not value-safe on fast transcendental functions.

Windows* OS	Linux* OS
/fp:fast	-fp-model fast
/fp:precise /Qfast-transcendentals	-fp-model precise -fast-transcendentals

CAUTION: This may reduce floating point accuracy.

Use the precise floating point model and enforce vectorization of the source loop using a directive: #pragma simd or #pragma omp simd

Example:

gcc program.c -O2 -fopenmp -fp-model precise -fast-transcendentals

#pragma omp simd collapse(2)
for(i=0; i<N; i++) 
{
  a[i] = b[i] * c[i];
  for(i=0; i<N; i++) 
  { 
    d[i] = e[i] * f[i]; 
  }
}

Read More:

Issue: System function call(s) present

System function call(s) in the loop body are preventing the compiler from vectorizing the loop.

Recommendation: Remove system function call(s) inside loop

Confidence:

%level%

Typically system function or subroutine calls cannot be vectorized; even a print statement is sufficient to prevent vectorization. To fix: Avoid using system function calls in loops.

Issue: OpenMP function call(s) present

OpenMP* function call(s) in the loop body are preventing the compiler from effectively vectorizing the loop.

Recommendation: Move OpenMP call(s) outside the loop body

Confidence:

%level%

OpenMP calls prevent automatic vectorization when the compiler cannot move the calls outside the loop body, such as when OpenMP calls are not invariant. To fix:

Split the OpenMP parallel loop recommendation into two using directives.

Target	Directive
Outer recommendation	#pragma omp parallel recommendations
Inner recommendation	#pragma omp for nowait

Move the OpenMP calls outside the loop when possible.

Example:

Original code:

#pragma omp parallel for private(tid, nthreads)
for (int k = 0; k < N; k++)
{
   tid = omp_get_thread_num(); // this call inside loop prevents vectorization
   nthreads = omp_get_num_threads(); // this call inside loop prevents vectorization
   ...
}

Revised code:

#pragma omp parallel private(tid, nthreads)
{
   // Move OpenMP calls here
   tid = omp_get_thread_num();
   nthreads = omp_get_num_threads();
   
   #pragma omp for nowait
   for (int k = 0; k < N; k++)
   {
	  ...
   }
}

Read More:

Recommendation: Remove OpenMP lock functions

Confidence:

%level%

Locking objects slows loop execution. To fix: Rewrite the code without OpenMP lock functions.

Example:

Allocating separate arrays for each thread and then merging them after a parallel recommendation may improve speed (but consume more memory).

Original code:

int A[n];
list<int> L;
...
omp_lock_t lock_obj;
omp_init_lock(&lock_obj);
#pragma omp parallel for shared(L, A, lock_obj) default(none)
for (int i = 0; i < n; ++i)
{
   // A[i] calculation
   ...
   if (A[i]<1.0)
   {
      omp_set_lock(&(lock_obj));
      L.insert(L.begin(), A[i]);
      omp_unset_lock(&(lock_obj));
   }
}
omp_destroy_lock(&lock_obj);

Revised code:

int A[n];
list<int> L;
omp_set_num_threads(nthreads_all);
...
vector<list<int>> L_by_thread(nthreads_all); // separate list for each thread
#pragma omp parallel shared(L, L_by_thread, A) default(none)
{
   int k = omp_get_thread_num();
   #pragma omp for nowait
   for (int i = 0; i < n; ++i)
   {
      // A[i] calculation
      ...
      if (A[i]<1.0)
      {
         L_by_thread[k].insert(L_by_thread[k].begin(), A[i]);
      }
   }
}

// merge data into single list
for (int k = 0; k < L_by_thread.size(); k++)
{
  L.splice(L.end(), L_by_thread[k]);
}

Read More:

Calling Functions on the CPU to Modify the Coprocessor's Execution Environment, Lock Routines recommendation in OpenMP Run-time Library Routines, omp for, omp parallel recommendations
Getting Started with Intel Compiler Pragmas and Directives and Vectorization Resources for Intel® Advisor XE Users

Issue: Indirect function call(s) present

Indirect function call(s) in the loop body are preventing the compiler from vectorizing the loop. Indirect calls, sometimes called indirect jumps, get the callee address from a register or memory; direct calls get the callee address from an argument. Even if you force loop vectorization, indirect calls remain serialized.

Recommendation: Remove indirect call(s) inside loop

Confidence:

%level%

Indirect function or subroutine calls cannot be vectorized. To fix: Avoid using indirect calls in loops.

Recommendation: Improve branch prediction

Confidence:

%level%

For 64-bit applications, branch prediction performance can be negatively impacted when the branch target is more than 4 GB away from the branch. This is more likely to happen when the application is split into shared libraries. To fix: Do the following:

Upgrade the Glibc library to version 2.23 or higher.
Set environment variable export LD_PREFER_MAP_32BIT_EXEC=1.

Read More:

Issue: Assumed dependency present

The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.

Recommendation: Confirm dependency is real

Confidence:

%level%

There is no confirmation that a real (proven) dependency is present in the loop. To confirm: Run a Dependencies analysis.

Recommendation: Resolve dependency

Confidence:

%level%

The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following:

If there is an anti-dependency, enable vectorization using the directive #pragma omp simd safelen(length) , where length is smaller than the distance between dependent iterations in anti-dependency. For example:
```
#pragma omp simd safelen(4)
for (i = 0; i < n - 4; i += 4)
{
    a[i + 4] = a[i] * c;
}
```
If there is a reduction pattern dependency in the loop, enable vectorization using the directive #pragma omp simd reduction(operator:list). For example:
```
#pragma omp simd reduction(+:sumx)
for (k = 0;k < size2; k++) 
{
    sumx += x[k]*b[k];
}
```
Rewrite the code to remove the dependency. Use programming techniques such as variable privatization.

Read More:

Recommendation: Enable vectorization

Confidence:

%level%

The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the restrict keyword or a directive:

Directive	Outcome
#pragma simd or #pragma omp simd	Ignores all dependencies in the loop
#pragma ivdep	Ignores only vector dependencies (which is safest)

Example:

#pragma ivdep
for (i = 0; i < n - 4; i += 4)
{
    a[i + 4] = a[i] * c;
}

Read More:

Issue: Vector register spilling possible

Possible register spilling was detected and all vector registers are in use. This may negatively impact performance, because the spilled variable must be loaded to and unloaded from main memory. Improve performance by decreasing vector register pressure.

Recommendation: Decrease unroll factor

Confidence:

%level%

The current directive unroll factor increases vector register pressure. To fix: Decrease unroll factor using a directive: #pragma nounroll or #pragma unroll.

Example:

void nounroll(int a[], int b[], int c[], int d[]) 
{
    #pragma nounroll
    for (int i = 1; i < 100; i++) 
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    } 
}

Read More:

Recommendation: Split loop into smaller loops

Confidence:

%level%

Possible register spilling along with high vector register pressure is preventing effective vectorization. To fix: Use the directive #pragma distribute_point or rewrite your code to distribute the source loop. This can decrease register pressure as well as enable software pipelining and improve both instruction and data cache use.

Example:

#define NUM 1024 
void loop_distribution_pragma2(
       double a[NUM], double b[NUM], double c[NUM],
       double x[NUM], double y[NUM], double z[NUM] ) 
{
   int i;
   // After distribution or splitting the loop.
   for (i=0; i< NUM; i++) 
   {
      a[i] = a[i] +i;
      b[i] = b[i] +i;
      c[i] = c[i] +i;
      #pragma distribute_point
      x[i] = x[i] +i;
      y[i] = y[i] +i;
      z[i] = z[i] +i;
   } 
}

Read More:

Issue: Possible inefficient memory access patterns present

Inefficient memory access patterns may result in significant vector code execution slowdown or block automatic vectorization by the compiler. Improve performance by investigating.

Recommendation: Confirm inefficient memory access patterns

Confidence:

%level%

There is no confirmation inefficient memory access patterns are present. To confirm: Run a Memory Access Patterns analysis.

Issue: Inefficient memory access patterns present

There is a high of percentage memory instructions with irregular (variable or random) stride accesses. Improve performance by investigating and handling accordingly.

Recommendation: Use SoA instead of AoS

Confidence:

%level%

An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.

Read More:

Recommendation: Use Intel SDLT

Confidence:

%level%

The cost of rewriting code to organize data using SoA instead of AoS may outweigh the benefit. To fix: Use Intel SIMD Data Layout Templates (Intel SDLT), introduced in version 16.1 of the Intel compiler, to mitigate the cost. Intel SDLT is a C++11 template library that may reduce code rewrites to just a few lines.

Read More:

Recommendation: Reorder loops

Confidence:

%level%

This loop has less efficient memory access patterns than a nearby outer loop. To fix: Run a Memory Access Patterns analysis on the outer loop. If the memory access patterns are more efficient for the outer loop, reorder the loops if possible.

Issue: Potential underutilization of FMA instructions

Your current hardware supports the AVX2 instruction set architecture (ISA), which enables the use of fused multiply-add (FMA) instructions. Improve performance by utilizing FMA instructions.

Recommendation: Target the AVX2 ISA

Confidence:

%level%

Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, no AVX2-specific code executed for this loop. To fix: Use the xCORE-AVX2 compiler option to generate AVX2-specific code, or the axCORE-AVX2 compiler option to enable multiple, feature-specific, auto-dispatch code generation, including AVX2.

Windows* OS	Linux* OS
/QxCORE-AVX2 or /QaxCORE-AVX2	-xCORE-AVX2 or -axCORE-AVX2

Read More:

ax, Qax; x, Qx
Code Generation Options in the Intel® C++ Compiler 16.0 User and Reference Guide
Compiling for the Intel® Xeon Phi™ processor x200 and the Intel® AVX-512 ISA and Vectorization Resources for Intel® Advisor Users

Recommendation: Target a specific ISA instead of using the xHost option

Confidence:

%level%

Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, no AVX2-specific code executed for this loop. To fix: Instead of using the xHost compiler option, which limits optimization opportunities by the host ISA, use the axCORE-AVX2 compiler option to compile for machines with and without AVX2 support, or the xCORE-AVX2 compiler option to compile for machines with AVX2 support only.

Windows* OS	Linux* OS
/QxCORE-AVX2 or /QaxCORE-AVX2	-xCORE-AVX2 or -axCORE-AVX2

Read More:

ax, Qax; x, Qx
Code Generation Options in the Intel® C++ Compiler 16.0 User and Reference Guide
Compiling for the Intel® Xeon Phi™ processor x200 and the Intel® AVX-512 ISA and Vectorization Resources for Intel® Advisor Users

Recommendation: Explicitly enable FMA generation when using the strict floating-point model

Confidence:

%level%

Static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, but the strict floating-point model disables FMA instruction generation by default. To fix: Override this behavior using the fma compiler option.

Windows* OS	Linux* OS
/Qfma	-fma

Read More:

fma, Qfma
Floating-Point Operations and Code Generation Options in the Intel® C++ Compiler 16.0 User and Reference Guide
Vectorization Resources for Intel® Advisor Users

Recommendation: Force vectorization if possible

Confidence:

%level%

The loop contains FMA instructions (so vectorization could be beneficial) but is not vectorized. To fix: Review corresponding compiler diagnostics to check if vectorization enforcement is possible and profitable.

Read More:

Vectorization Resources for Intel® Advisor Users

Issue: Inefficient processing of SIMD-enabled functions possible

Vector declaration defaults for your SIMD-enabled functions may result in extra computations or ineffective memory access patterns. Improve performance by overriding defaults.

Recommendation: Target a specific processor type

Confidence:

%level%

The default instruction set architecture (ISA) for SIMD-enabled functions is inefficient for your host processor because it could result in extra memory operations between registers. To fix: Do one of the following to add a processor clause to your vector declaration.

Add processor(cpuid) to your #pragma omp declare simd directive.
For Windows* OS: Add processor(cpuid ) to your _declspec(vector()) declaration.
For Linux* OS: Add processor(cpuid) to your _attribute_(vector()) declaration.

Read More: