All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/ remainder loops to the loop body.
Recommendation: Specify the expected loop trip count | Confidence: | %level% |
The compiler cannot statically detect the trip count. To fix: Identify the expected number of iterations using a directive: !DIR$ LOOP COUNT.
Example: Iterate through a loop a minimum of three, maximum of ten, and average of five times:
!DIR$ LOOP COUNT (10000) do i =1, m b(i) = a(i) + 1 d(i) = c(i) + 1 enddo
Read More:
Recommendation: Disable unrolling | Confidence: | %level% |
The trip count after loop unrolling is too small compared to the vector length. To fix: Prevent loop unrolling or decrease the unroll factor using a directive: !DIR$ NOUNROLL or !DIR$ UNROLL.
Example: Disable automatic loop unrolling using !DIR$ SIMD NOUNROLL
!DIR$ NOUNROLL do i =1, m b(i) = a(i) + 1 d(i) = c(i) + 1 enddo
Read More:
Recommendation: Use a smaller vector length | Confidence: | %level% |
The compiler chose a vector length, but the trip count might be smaller than that vector length. To fix: Specify a smaller vector length using a directive: !DIR$ SIMD VECTORLENGTH.
Example: Specify vector length using !DIR$ SIMD VECTORLENGTH(4)
!DIR$ SIMD VECTORLENGTH(4) do i =1, m b(i) = a(i) + 1 d(i) = c(i) + 1 enddo
Read More:
Recommendation: Align data | Confidence: | %level% |
One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned. To align data, use __declspec(align()) . To tell the compiler the data is aligned, use __assume_aligned() before the source loop.
Read More:
Recommendation: Add data padding | Confidence: | %level% |
The trip count is not a multiple of vector length. To fix: Do one of the following:
Note: These compiler options apply only to Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Option -qopt-assume-safe-padding is the replacement compiler option for-opt-assume-safe-padding, which is deprecated.
When you use one of these compiler options, the compiler does not add any padding for static and automatic objects. Instead, it assumes that code can access up to 64 bytes beyond the end of the object, wherever the object appears in your application. To satisfy this assumption, you must increase the size of static and automatic objects in your application.
Optional: Specify the trip count, if it is not constant, using a directive: !DIR$ LOOP COUNT
Read More:
Recommendation: Collect trip counts data | Confidence: | %level% |
The Survey Report lacks trip counts data that might generate more precise recommendations. To fix: Run a Trip Counts analysis.
Recommendation: Force vectorized remainder | Confidence: | %level% |
The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: !DIR$ SIMD VECREMAINDER or !DIR$ VECTOR VECREMAINDER.
Example: Force the compiler to vectorize the remainder loop using #pragma simd vecremaindersubroutine add(A, N, X) integer N, X real A(N) DIR$ SIMD VECREMAINDER do i=x+1, n a(i) = a(i) + a(i-x) enddo end
Read More:
There are multiple data types within loops. Utilize hardware vectorization support more effectively by avoiding data type conversion.
Recommendation: Use the smallest data type | Confidence: | %level% |
The source loop contains data types of different widths. To fix: Use the smallest data type that gives the needed precision to use the entire vector register width.
Example: If only 16-bits are needed, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.
User-defined functions in the loop body are preventing the compiler from vectorizing the loop
Recommendation: Enable inline expansion | Confidence: | %level% |
Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
Read More:
Recommendation: Vectorize user function(s) inside loop | Confidence: | %level% |
Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:
Example:
real function f (x) !DIR$ OMP DECLARE SIMD real, intent(in), value :: x f= x + 1 end function f !DIR$ OMP SIMD do k = 1, N a(k) = f(k) enddo
Read More:
Convert to Fortran SIMD-enabled functions | Confidence: | %level% |
Passing an array/array recommendation to an ELEMENTAL function/subroutine is creating a dependency that prevents vectorization. To fix:
Example:
Original code:
elemental subroutine callee(t,q,r) real, intent(in) :: t, q real, intent(out) :: r r = t + q end subroutine callee ... do k = 1,nlev call callee(a(:,k), b(:,k), c(:,k)) end do ...
Revised code:
subroutine callee(t,q,r) !$OMP DECLARE SIMD(callee) real, intent(in) :: t, q real, intent(out) :: r r = t + q end subroutine callee ... do k = 1,nlev !$OMP SIMD do i = 1,n call callee(a(i,k), b(i,k), c(i,k)) end do end do ...
Read More:
User-defined functions in the loop body are not vectorized.
Recommendation: Enable inline expansion | Confidence: | %level% |
Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
Read More:
Recommendation: Vectorize serialized function(s) inside loop | Confidence: | %level% |
Some user-defined function(s) are not vectorized or inlined by the compiler. To fix: Do one of the following:
Example:
real function f (x) !DIR$ OMP DECLARE SIMD real, intent(in), value :: x f= x + 1 end function f !DIR$ OMP SIMD do k = 1, N a(k) = f(k) enddo
Read More:
Math functions in the loop body are preventing the compiler from effectively vectorizing the loop. Improve performance by enabling vectorized math call(s).
Recommendation: Enable inline expansion | Confidence: | %level% |
Inlining is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
Read More:
Recommendation: Use the Intel short vector math library for vector intrinsics | Confidence: | %level% |
Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:
Example:
gfortran PROGRAM.FOR -O2 -ftree-vectorize -funsafe-math-optimizations -mveclibabi=svml -L/opt/intel/lib/intel64 -lm -lsvml -Wl,-rpath=/opt/intel/lib/intel64
program main parameter (N=100000000) real*8 angles(N), results(N) integer i call srand(86456) do i=1,N angles(i) = rand() enddo ! the loop will be auto-vectorized do i=1,N results(i) = cos(angles(i)) enddo end
Read More:
Recommendation: Use a Glibc library with vectorized SVML functions | Confidence: | %level% |
Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:
Note : Also use the -I/path/to/glibc/install/include and -L/path/to/glibc/install/lib compiler options if you have multiple Glibc libraries installed on the host.
Example:
gfortran PROGRAM.FOR -O2 -fopenmp -ffast-math -lrt -lm -mavx2
program main parameter (N=100000000) real*8 angles(N), results(N) integer i call srand(86456) do i=1,N angles(i) = rand() enddo !$OMP SIMD do i=1,N results(i) = cos(angles(i)) enddo end
Read More:
Recommendation: Vectorize math function calls inside loops | Confidence: | %level% |
Your application calls serialized versions of math functions when you use the precise floating point model. To fix: Do one of the following:
CAUTION: This may reduce floating point accuracy.
Example:
subroutine add(A, N, X) integer N, X real A(N) !DIR$ OMP SIMD do i=x+1, n a(i) = a(i) + a(i-x) enddo end
Read More:
Recommendation: Change the floating point model | Confidence: | %level% |
Your application calls serialized versions of math functions when you use the strict floating point model. To fix: Do one of the following:
Windows* OS | Linux* OS |
---|---|
/fp:fast | -fp-model fast |
/fp:precise /Qfast-transcendentals | -fp-model precise -fast-transcendentals |
CAUTION: This may reduce floating point accuracy.
Example:
gfortran program.for -O2 -fopenmp -fp-model precise -fast-transcendentals
!DIR$ OMP SIMD COLLAPSE(2) do i = 1, N a(i) = b(i) * c(i) do j = 1, N d(j) = e(j) * f(j) enddo enddo
Read More:
System function call(s) in the loop body are preventing the compiler from vectorizing the loop.
Recommendation: Remove system function call(s) inside loop | Confidence: | %level% |
Typically system function or subroutine calls cannot be vectorized; even a print statement is sufficient to prevent vectorization. To fix: Avoid using system function calls in loops.
OpenMP* function call(s) in the loop body are preventing the compiler from effectively vectorizing the loop.
Recommendation: Move OpenMP call(s) outside the loop body | Confidence: | %level% |
OpenMP calls prevent automatic vectorization when the compiler cannot move the calls outside the loop body, such as when OpenMP calls are not invariant. To fix:
Example:
Original code:
!$OMP PARALLEL DO PRIVATE(tid, nthreads) do k = 1, N tid = omp_get_thread_num() ! this call inside loop prevents vectorization nthreads = omp_get_num_threads() ! this call inside loop prevents vectorization ... enddo
Revised code:
!$OMP PARALLEL PRIVATE(tid, nthreads) ! Move OpenMP calls here tid = omp_get_thread_num() nthreads = omp_get_num_threads() $!OMP DO NOWAIT do k = 1, N ... enddo !$OMP END PARALLEL
Read More:
Recommendation: Remove OpenMP lock functions | Confidence: | %level% |
Locking objects slows loop execution. To fix: Rewrite the code without OpenMP lock functions. For example, allocating separate arrays for each thread and then merging them after a parallel recommendation may improve speed (but consume more memory).
Read More:
Indirect function call(s) in the loop body are preventing the compiler from vectorizing the loop. Indirect calls, sometimes called indirect jumps, get the callee address from a register or memory; direct calls get the callee address from an argument. Even if you force loop vectorization, indirect calls remain serialized.
Recommendation: Remove indirect call(s) inside loop | Confidence: | %level% |
Indirect function or subroutine calls cannot be vectorized. To fix: Avoid using indirect calls in loops.
Recommendation: Improve branch prediction | Confidence: | %level% |
For 64-bit applications, branch prediction performance can be negatively impacted when the branch target is more than 4 GB away from the branch. This is more likely to happen when the application is split into shared libraries. To fix: Do the following:
Read More:
The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.
Recommendation: Confirm dependency is real | Confidence: | %level% |
There is no confirmation that a real (proven) dependency is present in the loop. To confirm: Run a Dependencies analysis.
Recommendation: Resolve dependency | Confidence: | %level% |
The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following:
!$OMP SIMD SAFELEN(4) do i = 1, N-4, 4 a(i+4) = b(i) * c enddo
!$OMP SIMD REDUCTION(+:SUMX) do k = 1, size2 sumx = sumx + x(k) * b(k) enddo
Read More:
Recommendation: Enable vectorization | Confidence: | %level% |
The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the restrict keyword or a directive:
Directive | Outcome |
---|---|
!DIR$ SIMD or !$OMP SIMD | Ignores all dependencies in the loop |
!DIR$ IVDEP | Ignores only vector dependencies (which is safest) |
Example:
!DIR$ OMP SIMD IVDEP do i = 1, N-4, 4 a(i+4) = b(i) * c enddo
Read More:
Possible register spilling was detected and all vector registers are in use. This may negatively impact performance, because the spilled variable must be loaded to and unloaded from main memory. Improve performance by decreasing vector register pressure.
Recommendation: Decrease unroll factor | Confidence: | %level% |
The current directive unroll factor increases vector register pressure. To fix: Decrease unroll factor using a directive: !DIR$ NOUNROLL or !DIR$ UNROLL.
Example:
!DIR$ UNROLL do i =1, m b(i) = a(i) + 1 d(i) = c(i) + 1 enddo
Read More:
Recommendation: Split loop into smaller loops | Confidence: | %level% |
Possible register spilling along with high vector register pressure is preventing effective vectorization. To fix: Use the directive !DIR$ DISTRIBUTE POINT or rewrite your code to distribute the source loop. This can decrease register pressure as well as enable software pipelining and improve both instruction and data cache use.
Example:
!DIR$ DISTRIBUTE POINT do i =1, m b(i) = a(i) +1 .... c(i) = a(i) + b(i) ! Compiler will decide ! where to distribute. ! Data dependencies are ! observed .... d(i) = c(i) + 1 enddo do i =1, m b(i) = a(i) +1 .... !DIR$ DISTRIBUTE POINT call sub(a, n)! Distribution will start here, ! ignoring all loop-carried ! depedencies c(i) = a(i) + b(i) .... d(i) = c(i) + 1 enddo
Read More:
Inefficient memory access patterns may result in significant vector code execution slowdown or block automatic vectorization by the compiler. Improve performance by investigating.
There is a high of percentage memory instructions with irregular (variable or random) stride accesses. Improve performance by investigating and handling accordingly.
Recommendation: Use SoA instead of AoS | Confidence: | %level% |
An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.
Read More:
Recommendation: Use the Fortran 2008 CONTIGUOUS attribute | Confidence: | %level% |
The loop is multi-versioned for unit and non-unit strides in assumed-shape arrays or pointers, but marked versions of the loop have unit stride access only. The CONTIGUOUS attribute specifies the target of a pointer or an assumed-shape array is contiguous. It can make it easier to enable optimizations that rely on the memory layout of an object occupying a contiguous block of memory. Note: The results are indeterminate and could result in wrong answers and segmentation faults if the user assertion is wrong and the data is not contiguous at runtime.
Example:
real, pointer, contiguous :: ptr(:) real, contiguous :: arrayarg(:, :)
Read More:
Recommendation: Reorder loops | Confidence: | %level% |
This loop has less efficient memory access patterns than a nearby outer loop. To fix: Run a Memory Access Patterns analysis on the outer loop. If the memory access patterns are more efficient for the outer loop, reorder the loops if possible.
Your current hardware supports the AVX2 instruction set architecture (ISA), which enables the use of fused multiply-add (FMA) instructions. Improve performance by utilizing FMA instructions.
Recommendation: Target the AVX2 ISA | Confidence: | %level% |
Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, no AVX2-specific code executed for this loop. To fix: Use the xCORE-AVX2 compiler option to generate AVX2-specific code, or the axCORE-AVX2 compiler option to enable multiple, feature-specific, auto-dispatch code generation, including AVX2.
Read More:
Recommendation: Target a specific ISA instead of using the xHost option | Confidence: | %level% |
Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, no AVX2-specific code executed for this loop. To fix: Instead of using the xHost compiler option, which limits optimization opportunities by the host ISA, use the axCORE-AVX2 compiler option to compile for machines with and without AVX2 support, or the xCORE-AVX2 compiler option to compile for machines with AVX2 support only.
Read More:
Recommendation: Explicitly enable FMA generation when using the strict floating-point model | Confidence: | %level% |
Static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, but the strict floating-point model disables FMA instruction generation by default. To fix: Override this behavior using the fma compiler option.
Read More:
Recommendation: Force vectorization if possible | Confidence: | %level% |
The loop contains FMA instructions (so vectorization could be beneficial) but is not vectorized. To fix: Review corresponding compiler diagnostics to check if vectorization enforcement is possible and profitable.
Read More:
Vector declaration defaults for your SIMD-enabled functions may result in extra computations or ineffective memory access patterns. Improve performance by overriding defaults.
Recommendation: Target a specific processor type | Confidence: | %level% |
The default instruction set architecture (ISA) for SIMD-enabled functions is inefficient for your host processor because it could result in extra memory operations between registers. To fix: Add a PROCESSOR clause to your vector declaration. Specifically, add PROCESSOR(cpuid) to your !$OMP DECLARE SIMD directive.
Read More:
Recommendation: Specify the value of the underlying reference as linear | Confidence: | %level% |
In Fortran applications, by default, scalar arguments are passed by reference. Therefore, in SIMD-enabled functions, arguments are passed as a short vector of addresses instead of a single address. The compiler then gathers data from the vector of addresses to create a short vector of values for use in subsequent vector arithmetic. This gather activity negatively impacts performance. To fix: Add a LINEAR clause with a REF modifier (introduced in OpenMP* 4.5) to your vector declaration. Specifically, add LINEAR (REF(linear-list[: linear-step])) to your !$OMP DECLARE SIMD directive.
Read More: