Intel® C++ Compiler 16.0 User and Reference Guide
SIMD-enabled functions (formerly called elemental functions) are a general language construct to express a data parallel algorithm. A SIMD-enabled function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on a single element or it can be called in a data parallel context to operate on many elements. In Intel® Cilk™ Plus, the data parallel context is provided as an array.
If you are using SIMD-enabled functions and need to link a compiler object file with an object file from a previous version of the compiler (for example, 13.1), you need to use the [Q]vecabi compiler option, specifying thelegacy keyword. The default value (compat) is compatible with the GCC vector function support of both Intel® Cilk™ Plus and OpenMP* 4.0.
When you write a SIMD-enabled function, the compiler generates a short vector form of the function, which can perform your function's operation on multiple arguments in a single invocation. The short vector version may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector instruction set architecture (ISA) in the CPU. In addition, when invoked from a cilk_for or pragma omp construct, the compiler may assign different copies of the SIMD-enabled functions to different threads (or workers), executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA.
If the short vector function is called inside a parallel loop, a cilk_for loop, or an auto-parallelized loop that is vectorized, you can achieve both vector-level and thread-level parallelism.
In order for the compiler to generate the short vector function, you need to provide an indication in your code.
Windows*:
Use the __declspec(vector (clauses)) declaration, as follows:
__declspec(vector (clauses)) return_type simd_enabled_function_name(arguments)
Linux* and OS X*:
Use the __attribute__((vector (clauses))) declaration, as follows:
__attribute__((vector (clauses))) return_type simd_enabled_function_name(arguments)
The clauses for the vector declaration take the following values:
processor(cpuid) |
Where cpuid takes one of the following values:
|
vectorlength(n) |
Where n is a vectorlength. It must be an integer that is a power of 2. The value must be 2, 4, 8, or 16. The vectorlength clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to n times the scalar function execution. |
linear(list_item[,
list_item...])
|
The linear clause tells the compiler that for each consecutive invocation of the routine in a serial execution, the value of param is incremented by step, where param is a formal parameter of the specified function or the C++ keyword this. The linear clause can be used on parameters that are either scalar (non-arrays and of non-structured types), pointers, or C++ references. step is a compile-time integer constant expression, which defaults to 1 if omitted. If more than one step is specified for a particular parameter, a compile-time error occurs. Multiple linear clauses will be merged as a union. The meaning of each variant of the clause is as follows:
|
uniform(param [, param,]…) |
Where param is a formal parameter of the specified function or the C++ keyword this. The uniform clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. Multiple uniform clauses are merged as a union. |
[no]mask |
The [no]mask clause tells the compiler to generate a masked vector version of the routine. |
Write the code inside your function using existing C/C++ syntax and relevant built-in functions (see the section on __intel_simd_lane() below).
Typically, the invocation of a SIMD-enabled function provides arrays wherever scalar arguments are specified as formal parameters. Use the array notation syntax available in Intel® Cilk™ Plus to provide the arrays succinctly. Alternatively, you can invoke the function from a _Cilk_for loop.
The following two invocations will give instruction-level parallelism by having the compiler issue special vector instructions.
a[:] = ef_add(b[:],c[:]); //operates on the whole extent of the arrays a, b, c
a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride
To invoke the SIMD-enabled function in a data parallel context and use multiple cores and processors use _Cilk_for:
_Cilk_for (j = 0; j < n; ++j) { a[j] = ef_add(b[j],c[j]); }
Only the calling code using the _Cilk_for calling syntax is able to use all available parallelism. The array notation syntax, as well as calling the SIMD-enabled function from the regular for loop, results in invoking the short vector function in each iteration and utilizing the vector parallelism but the invocation is done in a serial loop, without utilizing multiple cores.
Use of array notation syntax and SIMD-enabled functions in a regular for loop results in invoking the short vector function in each iteration and utilizing the vector parallelism, but the invocation is done in a serial loop without utilizing multiple cores. To use all available parallelism one should explore the use of Intel® Cilk™ Plus keywords (e.g., cilk_for or cilk_spawn) or OpenMP*.
When called from within a vectorized loop, the __intel_simd_lane() built-in function will return a number between 0 and vectorlength - 1 that reflects the current "lane id" within the SIMD vector. __intel_simd_lane() will return zero if the loop is not vectorized. Calling __intel_simd_lane() outside of an explicit vector programming construct is discouraged. It may prevent auto-vectorization and such a call often results in the function returning zero instead of a value between 0 and vectorlength-1.
To see how __intel_simd_lane() can be used, consider the following example:
void accumulate(float *a, float *b, float *c, d){ *a+=sin(d); *b+=cos(d); *c+=log(d); } for (i=low; i<high; i++){ accumulate(&suma, &sumb, &sumc, d[i]); }
A first-run conversion to Intel® Cilk™ Plus vector code without __intel_simd_lane() might look like this:
#define VL 16 __declspec(vector(uniform(a,b,c), linear(i))) void accumulate(float *a, float *b, float *c, d, i){ a[i & (VL-1)]+=sin(d); b[i & (VL-1)]+=cos(d); c[i & (VL-1)]+=log(d); } float a[VL] = {0.0f}; float b[VL] = {0.0f}; float c[VL] = {0.0f}; #pragma omp simd for safe_veclen(VL) for (i=low; i<high; i++){ accumulate(a, b, c, d[i], i); } for(i=0;i<VL;i++){ suma += a[i]; sumb += b[i]; sumc += c[i]; }
The gather-scatter type memory addressing caused by the references to arrays A, B, and C in the SIMD-enabled function accumulate() will significantly hurt performance making the whole conversion useless. To avoid this penalty you may use the __intel_simd_lane() built-in function as follows:
__declspec(vector(uniform(a,b,c),aligned(a,b,c))) void accumulate(float *a, float *b, float *c, d){ // No need to take “loop index”. No need to know VL. a[__intel_simd_lane()]+=sin(d); b[__intel_simd_lane()]+=cos(d); c[__intel_simd_lane()]+=log(d); } #define VL 16 // actual SIMD code may use vectorlength of 4 but it’s okay. float a[VL] = {0.0f}; float b[VL] = {0.0f}; float c[VL] = {0.0f}; #pragma omp simd for safe_veclen(VL) for (i=low; i<high; i++){ // If low is known to be zero at compile time, “i & (VL-1)” // would accomplish what __intel_simd_lane() is intended for, // but only on the caller side. accumulate(a, b, c, d[i]); } for(i=0;i<VL;i++){ suma += a[i]; sumb += b[i]; sumc += c[i]; }
With use of __intel_simd_lane() the references to the arrays in accumulate() will have unit-stride.
Limitations
The following language constructs are not allowed within SIMD-enabled functions:
The GOTO statement
The switch statement with16 or more case statements
Operations on classes and structs (other than member selection)
The _Cilk_spawn keyword and any update to the Intel® Cilk™ Plus pedigree
Expressions with array notations