Intel® C++ Compiler 16.0 User and Reference Guide

Vectorizing for Intel® Graphics Technology

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

Vectorization is critical for performance on the processor graphics. The compiler generates efficient code with the least programming effort, particularly a subset of Intel® Cilk™ Plus language extensions.

Although you can often rely on the compiler's vectorization, using explicit vectorization, such as with #pragma simd or array notation is recommended.

#pragma simd guides the compiler to vectorize the loop. The compiler generates a warning if it cannot vectorize the loop.

#pragma simd now supports outer loop vectorization, whereby a loop is vectorized vertically and can include multiple nested loops. A vectorized loop can contain many useful coding patterns, including access to structures and arrays, and calls to vector functions.

Vector or SIMD-enabled functions significantly influence outer loop vectorization. A vector function is a regular function that can be invoked either as a scalar function for a single iteration of a loop, or for multiple iterations of the vectorized loop in parallel. To make a function elemental, you can annotate it with __declspec(vector) (Windows* and Linux*) or __attribute__((vector)) (Linux only), which guides the compiler to generate both a scalar and a vector form of the function. In the vector form, all arguments of the function become vectorized, unless they are qualified in __declspec(vector)/__attribute__((vector)) as either linear or uniform. Both linear and uniform are used for optimization, to indicate certain conventions between the caller and the vector form of the SIMD-enabled function, which enables you to pass a single scalar value instead of a vector of values:

Array Notation is a powerful extension that enables writing compact data parallel code. With vectorization enabled, the compiler implements array notations with vector code.

The basic form of an array notation expression is a section operator that is similar to a subscript operator but indicates that the operation will be applied to a section of an array rather than a single element, as follows:

section_operator ::= [ lower_bound : length : stride ]

Where the lower bound, length and stride are of integer types, representing a set of integer values as follows:

lower_bound, (lower_bound + stride), …, lower_bound + (length - 1) * stride

For example, A[2:8:2] refers to 8 elements of array A with indices 2, 4, … , 16.

Array Notations provide other interesting features such as access to implicit index variables and reduction functions.

Example: Crossfade

The code below is a very simple image filter that merges two images. It is produced from trivial scalar code by adding two pragmas.

bool CrossFade::execute_offload (int do_offload)
{  
   // we create temporary copies for "this" object members used in offload
   // because:
   // - for pointer typed members – it will not work correctly
   //	as we don't support pointer marshalling across offload boundaries:
   //	The target does not know what to do with CPU-style pointer values
   // - Also, for any object members – efficiency reasons to avoid
   //	double indirection on the target side
   unsigned char * inputArray1 = m_inputArray1, * inputArray2 = m_inputArray2,
   * outputArray = m_outputArray;
   int arrayWidth = m_arrayWidth, arrayHeight = m_arrayHeight, arraySize = m_arraySize;

   unsigned a1 = 256 - m_blendFactor;
   unsigned a2 = m_blendFactor;

   // pragma offload lists all data to be shared (or copied)
   // and is followed by a required parallel loop expressed as
   // _Cilk_for or an array notation statement to indicate that the loop
   // after it is parallel and needs to be parallelized on the target.
   #pragma offload target(gfx) if (do_offload) \
     pin(inputArray1, inputArray2, outputArray: length(arraySize))
   _Cilk_for (int i=0; i<arraySize; i++){
     outputArray[i] = (inputArray1[i] * a1 + inputArray2[i] * a2) >> 8;
   }
     return true;
}

Example: Tiled Matrix Multiplication

This code is matrix multiplication code with explicit tiling and use of Array Notation expressions for concise expression of data parallelism.

bool MatmultLocalsAN::execute_offload(int do_offload)
{
      // similarly to previous example, we create temporary copies
      // of "this" object members
      int m = m_height, n = m_width, k = m_common;
      float (* A)[k] = (float (*)[])m_matA;
      float (* B)[n] = (float (*)[])m_matB;
      float (* C)[n] = (float (*)[])m_matC;

      // note that the although A, B, C are pointers to an array,
      // the length is specified in elements of the pointed-to arrays
      // and not in units of size of the pointed-to arrays
      #pragma offload target(gfx) if (do_offload) \
            pin(A: length(m*k)), pin(B: length(k*n)), pin(C: length(m*n))
      // Perfectly nested parallel loops can be collapsed by the compiler
      _Cilk_for (int r = 0; r < m; r += TILE_m) {
         _Cilk_for (int c = 0; c < n; c += TILE_n) {
         // these arrays will be allocated in the 
         //Intel® Graphics Technology Register File (GRF),
         //  resulting in very efficient code
         float atile[TILE_m][TILE_k], btile[TILE_n], ctile[TILE_m][TILE_n];
         // Array Notation syntax to initialize ctile.
         // Will produce a sequence of vector operations
         // unroll to generate direct GRF accesses
         #pragma unroll
         ctile[:][:] = 0.0f;
         for (int t = 0; t < k; t += TILE_k) {
            // generates a series of vector loads
            #pragma unroll
            atile[:][:] = A[r:TILE_m][t:TILE_k];
            // unroll to generate direct GRF accesses
            #pragma unroll
 
            for (int rc = 0; rc < TILE_k; rc++) {
               // generates a vector load
               btile[:] = B[t+rc][c:TILE_n];
               #pragma unroll
               for (int rt = 0; rt < TILE_m; rt++) {
                  // generates vector operations
                  ctile[rt][:] += atile[rt][rc] * btile[:];
               }
            }
         }
         // generates a series of vector stores
         #pragma unroll
         C[r:TILE_m][c:TILE_n] = ctile[:][:];
      }
   }
      return true;
}

The following example demonstrates using #pragma simd and vector functions with linear and uniform arguments:

__declspec(target(gfx))
__declspec(vector(uniform(in1), linear(i)))
// Note that pointer-typed in1 argument is defined uniform.
// in1 allows to access the whole array in the vector function.
// i is declared as linear enabling the compiler to generate more
// efficient code.
// in2v and the return value are generated as vectors of floats. 
int vfunction(int * in1, int i, int in2v)
{
    return in1[i - 1] + in2v * in1[i] + in1[i + 1];
}

int main (int argc, char* argv) 
{
    const int size = 4096;
    const int chunkSize = 32;
    const int padding = chunkSize;
    int in1[size], in2[size], out[size];

    // initial values
    in1[:] = __sec_implicit_index(0);
    in2[:] = size - __sec_implicit_index(0);

    #pragma offload target(gfx) 
    _Cilk_for (int i = padding; i < size - padding; i+=chunkSize)
    {
        #pragma simd
        for (int j = 0; j < chunkSize; j++)
            out[i + j] = vfunction(in1, i + j, in2[i + j]);
    }

    // usage or output of the out array follows
    return 0;
}

Vectorization Considerations

Using SIMD-enabled functions and #pragma simd facilitates transitioning from scalar to vectorized code. Simple syntax both marks the corresponding code as vectorizable, and also enables you to convey additional optimization hints. The following factors may affect performance:

See Also