Issue: Ineffective peeled/remainder loop(s) present

All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/ remainder loops to the loop body.

Recommendation: Specify the expected loop trip count

The compiler cannot statically detect the trip count. To fix: Identify the expected number of iterations using a directive: #pragma loop_count.

Example: Iterate through a loop a minimum of three, maximum of ten, and average of five times:

#include <stdio.h>
		
int mysum(int start, int end, int a) 
{
    int iret=0; 
    #pragma loop_count min(3), max(10), avg(5)
    for (int i=start;i<=end;i++)
        iret += a;
    return iret; 
} 
              
int main() 
{
    int t;
    t = mysum(1, 10, 3);
    printf("t1=%d\r\n",t);
    t = mysum(2, 6, 2);
    printf("t2=%d\r\n",t);
    t = mysum(5, 12, 1);
    printf("t3=%d\r\n",t);
}

Read More:

Getting Started with Intel Compiler Pragmas and Directives and Vectorization Resources for Intel® Advisor Users

Recommendation: Disable unrolling

The trip count after loop unrolling is too small compared to the vector length. To fix: Prevent loop unrolling or decrease the unroll factor using a directive: #pragma nounroll or #pragma unroll.

Example: Disable automatic loop unrolling using #pragma nounroll

void nounroll(int a[], int b[], int c[], int d[]) 
{
    #pragma nounroll
    for (int i = 1; i < 100; i++) 
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    } 
}

Read More:

Getting Started with Intel Compiler Pragmas and Directives and Vectorization Resources for Intel® Advisor Users

Recommendation: Use a smaller vector length

The compiler chose a vector length, but the trip count might be smaller than that vector length. To fix: Specify a smaller vector length using a directive: #pragma simd vectorlength.

Example: Specify vector length using #pragma simd vectorlength(4)

void f(int a[], int b[], int c[], int d[]) 
{
    #pragma simd vectorlength(4)
    for (int i = 1; i < 100; i++) 
    {
        b[i] = a[i] + 1;
        d[i] = c[i] + 1;
    } 
}

Read More:

Getting Started with Intel Compiler Pragmas and Directives and Vectorization Resources for Intel® Advisor Users

Recommendation: Align data

One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.

Dynamic Data:

To align dynamic data, replace malloc() and free() with _mm_malloc() and _mm_free(). To tell the compiler the data is aligned, use __assume_aligned() before the source loop. Also consider using #include <aligned_new> to enable automatic allocation of aligned data.

Static Data:

To align static data, use __declspec(align()). To tell the compiler the data is aligned, use __assume_aligned() before the source loop.

Example - Dynamic Data:

Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:

float *array;
array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32);          
// Somewhere else
__assume_aligned(array, 32);
// Use array in loop
_mm_free(array);

Example - Static Data:

Align static data using a 64-byte boundary:

__declspec(align(64)) float array[ARRAY_SIZE]

Read More:

Data Alignment to Assist Vectorization and Vectorization Resources for Intel® Advisor Users

Recommendation: Add data padding

The trip count is not a multiple of vector length. To fix: Do one of the following:

Increase the size of objects and add iterations so the trip count is a multiple of vector length.
Increase the size of static and automatic objects, and use a compiler option to add data padding.

*Windows OS**	*Linux OS**
/Qopt-assume-safe-padding	-qopt-assume-safe-padding

Note: These compiler options apply only to Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Option -qopt-assume-safe-padding is the replacement compiler option for-opt-assume-safe-padding, which is deprecated.

When you use one of these compiler options, the compiler does not add any padding for static and automatic objects. Instead, it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in your application. To satisfy this assumption, you must increase the size of static and automatic objects in your application.

Optional: Specify the trip count, if it is not constant, using a directive: #pragma loop_count

Read More:

Utilizing Full Vectors and Use of Option -qopt-assume-safe-padding, Getting Started with Intel Compiler Pragmas and Directives, and Vectorization Resources for Intel® Advisor Users

Recommendation: Collect trip counts data

The Survey Report lacks trip counts data that might generate more precise recommendations. To fix: Run a Trip Counts analysis.

Recommendation: Force vectorized remainder

The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: #pragma simd vecremainder or #pragma vector vecremainder.

Example: Force the compiler to vectorize the remainder loop using #pragma simd vecremainder

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
    int i; 
    #pragma simd vecremainder 
    for (i=0; i<n; i++)
    {
        a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
    } 
}

Read More:

Getting Started with Intel Compiler Pragmas and Directives and Vectorization Resources for Intel® Advisor Users