Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
Vectorization is critical for performance on the processor graphics. The compiler generates efficient code with the least programming effort, particularly a subset of Intel® Cilk™ Plus language extensions.
Although you can often rely on the compiler's vectorization, using explicit vectorization, such as with #pragma simd or array notation is recommended.
#pragma simd guides the compiler to vectorize the loop. The compiler generates a warning if it cannot vectorize the loop.
#pragma simd now supports outer loop vectorization, whereby a loop is vectorized vertically and can include multiple nested loops. A vectorized loop can contain many useful coding patterns, including access to structures and arrays, and calls to vector functions.
Vector or SIMD-enabled functions significantly influence outer loop vectorization. A vector function is a regular function that can be invoked either as a scalar function for a single iteration of a loop, or for multiple iterations of the vectorized loop in parallel. To make a function elemental, you can annotate it with __declspec(vector) (Windows* and Linux*) or __attribute__((vector)) (Linux only), which guides the compiler to generate both a scalar and a vector form of the function. In the vector form, all arguments of the function become vectorized, unless they are qualified in __declspec(vector)/__attribute__((vector)) as either linear or uniform. Both linear and uniform are used for optimization, to indicate certain conventions between the caller and the vector form of the SIMD-enabled function, which enables you to pass a single scalar value instead of a vector of values:
linear indicates that the argument can only contain values linearly incremented with a stride known at compile time. The default is 1.
uniform indicates that the value of the argument can be broadcast for all iterations.
Array Notation is a powerful extension that enables writing compact data parallel code. With vectorization enabled, the compiler implements array notations with vector code.
The basic form of an array notation expression is a section operator that is similar to a subscript operator but indicates that the operation will be applied to a section of an array rather than a single element, as follows:
section_operator ::= [ lower_bound : length : stride ]
Where the lower bound, length and stride are of integer types, representing a set of integer values as follows:
lower_bound, (lower_bound + stride), …, lower_bound + (length - 1) * stride
For example, A[2:8:2] refers to 8 elements of array A with indices 2, 4, … , 16.
Array Notations provide other interesting features such as access to implicit index variables and reduction functions.
The code below is a very simple image filter that merges two images. It is produced from trivial scalar code by adding two pragmas.
bool CrossFade::execute_offload (int do_offload) { // we create temporary copies for "this" object members used in offload // because: // - for pointer typed members – it will not work correctly // as we don't support pointer marshalling across offload boundaries: // The target does not know what to do with CPU-style pointer values // - Also, for any object members – efficiency reasons to avoid // double indirection on the target side unsigned char * inputArray1 = m_inputArray1, * inputArray2 = m_inputArray2, * outputArray = m_outputArray; int arrayWidth = m_arrayWidth, arrayHeight = m_arrayHeight, arraySize = m_arraySize; unsigned a1 = 256 - m_blendFactor; unsigned a2 = m_blendFactor; // pragma offload lists all data to be shared (or copied) // and is followed by a required parallel loop expressed as // _Cilk_for or an array notation statement to indicate that the loop // after it is parallel and needs to be parallelized on the target. #pragma offload target(gfx) if (do_offload) \ pin(inputArray1, inputArray2, outputArray: length(arraySize)) _Cilk_for (int i=0; i<arraySize; i++){ outputArray[i] = (inputArray1[i] * a1 + inputArray2[i] * a2) >> 8; } return true; }
This code is matrix multiplication code with explicit tiling and use of Array Notation expressions for concise expression of data parallelism.
bool MatmultLocalsAN::execute_offload(int do_offload) { // similarly to previous example, we create temporary copies // of "this" object members int m = m_height, n = m_width, k = m_common; float (* A)[k] = (float (*)[])m_matA; float (* B)[n] = (float (*)[])m_matB; float (* C)[n] = (float (*)[])m_matC; // note that the although A, B, C are pointers to an array, // the length is specified in elements of the pointed-to arrays // and not in units of size of the pointed-to arrays #pragma offload target(gfx) if (do_offload) \ pin(A: length(m*k)), pin(B: length(k*n)), pin(C: length(m*n)) // Perfectly nested parallel loops can be collapsed by the compiler _Cilk_for (int r = 0; r < m; r += TILE_m) { _Cilk_for (int c = 0; c < n; c += TILE_n) { // these arrays will be allocated in the //Intel® Graphics Technology Register File (GRF), // resulting in very efficient code float atile[TILE_m][TILE_k], btile[TILE_n], ctile[TILE_m][TILE_n]; // Array Notation syntax to initialize ctile. // Will produce a sequence of vector operations // unroll to generate direct GRF accesses #pragma unroll ctile[:][:] = 0.0f; for (int t = 0; t < k; t += TILE_k) { // generates a series of vector loads #pragma unroll atile[:][:] = A[r:TILE_m][t:TILE_k]; // unroll to generate direct GRF accesses #pragma unroll for (int rc = 0; rc < TILE_k; rc++) { // generates a vector load btile[:] = B[t+rc][c:TILE_n]; #pragma unroll for (int rt = 0; rt < TILE_m; rt++) { // generates vector operations ctile[rt][:] += atile[rt][rc] * btile[:]; } } } // generates a series of vector stores #pragma unroll C[r:TILE_m][c:TILE_n] = ctile[:][:]; } } return true; }
The following example demonstrates using #pragma simd and vector functions with linear and uniform arguments:
__declspec(target(gfx)) __declspec(vector(uniform(in1), linear(i))) // Note that pointer-typed in1 argument is defined uniform. // in1 allows to access the whole array in the vector function. // i is declared as linear enabling the compiler to generate more // efficient code. // in2v and the return value are generated as vectors of floats. int vfunction(int * in1, int i, int in2v) { return in1[i - 1] + in2v * in1[i] + in1[i + 1]; } int main (int argc, char* argv) { const int size = 4096; const int chunkSize = 32; const int padding = chunkSize; int in1[size], in2[size], out[size]; // initial values in1[:] = __sec_implicit_index(0); in2[:] = size - __sec_implicit_index(0); #pragma offload target(gfx) _Cilk_for (int i = padding; i < size - padding; i+=chunkSize) { #pragma simd for (int j = 0; j < chunkSize; j++) out[i + j] = vfunction(in1, i + j, in2[i + j]); } // usage or output of the out array follows return 0; }
Using SIMD-enabled functions and #pragma simd facilitates transitioning from scalar to vectorized code. Simple syntax both marks the corresponding code as vectorizable, and also enables you to convey additional optimization hints. The following factors may affect performance:
There is no support for function-scoped uniform data in SIMD-enabled functions. Uniform data that is shared by multiple vectorized iterations needs to be passed via a function argument that is annotated as uniform. This prohibits allocation of such uniform data on GRF, unless the function is inlined. But the information from __declspec(vector) is not used if the function is inlined, so you can guide vectorization behavior by annotating the vectorized loop containing the inlined function body with #pragma simd.
Arrays and structure-typed variables defined inside vectorized context, either a vectorized loop or a SIMD-enabled function, are replicated by vector length, which increases register pressure and may require you to explicitly request a smaller vector length to avoid spilling. Currently such arrays and structures are vectorized in AoS style. For example, int arr[32] becomes int arr[vector_length][32]. Consequently, accesses to such arrays and structures are converted to gather/scatter vector accesses, which, in the case of GRF arrays, is often de-vectorized. Alternatively, the compiler may scalarize accesses to structures before vectorization, meaning that each field is converted to a separate temporary scalar and is vectorized separately, so that AoS-style vectorization will not necessarily impact performance of your kernels. If this causes problems for your kernels, consider re-writing your kernel using explicit SoA-style data layout via multi-dimensional arrays and using Array Notations for compact representation of vector memory accesses.
Using #pragma simd or Array Notations can better direct the compiler to vectorize code, especially if you are not sure that auto-vectorization did the best job for your code.
Short vector types of 64-bit and 128-bit length are not currently supported by the vectorizer for Intel® Graphics Technology-based coprocessors. This means that the compiler will not vectorize code operating on the following:
1-byte data types if vector length is 4, 8 or 16 elements
2-byte data types if vector length is 4 or 8 elements
4-byte data types if vector length is 4 elements
The compiler will try a supported vector length unless an unsupported one is explicitly requested or will produce scalar code and generate warnings in case of #pragma simd or vector functions.