SIMD-Enabled Functions

SIMD-enabled functions (formerly called elemental functions) are a general language construct to express a data parallel algorithm. A SIMD-enabled function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on a single element or it can be called in a data parallel context to operate on many elements. In Intel® Cilk™ Plus, the data parallel context is provided as an array.

If you are using SIMD-enabled functions and need to link a compiler object file with an object file from a previous version of the compiler (for example, 13.1), you need to use the [Q]vecabi compiler option, specifying thelegacy keyword. The default value (compat) is compatible with the GCC vector function support of both Intel® Cilk™ Plus and OpenMP* 4.0.

How SIMD-Enabled Functions Work

When you write a SIMD-enabled function, the compiler generates a short vector form of the function, which can perform your function's operation on multiple arguments in a single invocation. The short vector version may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector instruction set architecture (ISA) in the CPU. In addition, when invoked from a cilk_for or pragma omp construct, the compiler may assign different copies of the SIMD-enabled functions to different threads (or workers), executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA.

If the short vector function is called inside a parallel loop, a cilk_for loop, or an auto-parallelized loop that is vectorized, you can achieve both vector-level and thread-level parallelism.

Declaring a SIMD-Enabled Function

In order for the compiler to generate the short vector function, you need to provide an indication in your code.

Windows*:

Use the __declspec(vector (clauses)) declaration, as follows:

__declspec(vector (clauses)) return_type simd_enabled_function_name(arguments)

Linux* and OS X*:

Use the __attribute__((vector (clauses))) declaration, as follows:

__attribute__((vector (clauses))) return_type simd_enabled_function_name(arguments)

The clauses for the vector declaration take the following values:

processor(cpuid)	Where `cpuid` takes one of the following values: `core_4th_gen_avx_tsx` `core_4th_gen_avx` `mic` `core_3rd_gen_avx` `core_2nd_gen_avx` `core_aes_pclmulqdq` `core_i7_sse4_2` `atom` `core_2_duo_sse4_1` `core_2_duo_ssse3` `pentium_4_sse3` `pentium_m` `pentium_4`
vectorlength(n)	Where `n` is a vectorlength. It must be an integer that is a power of 2. The value must be 2, 4, 8, or 16. The vectorlength clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to `n` times the scalar function execution.
linear(list_item[, list_item...]) where list_item is one of: param[:step], val(param[:step]), ref(param[:step]), or uval(param[:step])	The linear clause tells the compiler that for each consecutive invocation of the routine in a serial execution, the value of `param` is incremented by `step`, where param is a formal parameter of the specified function or the C++ keyword `this`. The linear clause can be used on parameters that are either scalar (non-arrays and of non-structured types), pointers, or C++ references. step is a compile-time integer constant expression, which defaults to 1 if omitted. If more than one step is specified for a particular parameter, a compile-time error occurs. Multiple linear clauses will be merged as a union. The meaning of each variant of the clause is as follows: linear(param[:step]) For parameters that are not C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the value of the parameter will be incremented by step. linear(val(param[:step])) For parameters that are C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the referenced value of the parameter will be incremented by step. linear(uval(param[:step])) For C++ references: means the same as linear(val()). It differs from linear(val()) so that in case of linear(val()) a vector of references is passed to vector variant of the routine but in case of linear(uval()) only one reference is passed (and thus linear(uval()) is better to use in terms of performance). linear(ref(param[:step])) For C++ references: means that the reference itself is linear, i.e. the referenced values (that form a vector for calculations) are located sequentially, like in array with the distance between elements equal to step.
uniform(param [, param,]…)	Where `param` is a formal parameter of the specified function or the C++ keyword `this`. The uniform clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. Multiple uniform clauses are merged as a union.
[no]mask	The [no]mask clause tells the compiler to generate a masked vector version of the routine.

Write the code inside your function using existing C/C++ syntax and relevant built-in functions (see the section on __intel_simd_lane() below).

Invoking a SIMD-Enabled Function with Parallel Context

Typically, the invocation of a SIMD-enabled function provides arrays wherever scalar arguments are specified as formal parameters. Use the array notation syntax available in Intel® Cilk™ Plus to provide the arrays succinctly. Alternatively, you can invoke the function from a _Cilk_for loop.

The following two invocations will give instruction-level parallelism by having the compiler issue special vector instructions.

a[:] = ef_add(b[:],c[:]);    //operates on the whole extent of the arrays a, b, c

a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride

To invoke the SIMD-enabled function in a data parallel context and use multiple cores and processors use _Cilk_for:

_Cilk_for (j = 0; j < n; ++j) {
  a[j] = ef_add(b[j],c[j]);
}

Note

Only the calling code using the _Cilk_for calling syntax is able to use all available parallelism. The array notation syntax, as well as calling the SIMD-enabled function from the regular for loop, results in invoking the short vector function in each iteration and utilizing the vector parallelism but the invocation is done in a serial loop, without utilizing multiple cores.

Use of array notation syntax and SIMD-enabled functions in a regular for loop results in invoking the short vector function in each iteration and utilizing the vector parallelism, but the invocation is done in a serial loop without utilizing multiple cores. To use all available parallelism one should explore the use of Intel® Cilk™ Plus keywords (e.g., cilk_for or cilk_spawn) or OpenMP*.

Using the `__intel_simd_lane()` built-in function

When called from within a vectorized loop, the __intel_simd_lane() built-in function will return a number between 0 and vectorlength - 1 that reflects the current "lane id" within the SIMD vector. __intel_simd_lane() will return zero if the loop is not vectorized. Calling __intel_simd_lane() outside of an explicit vector programming construct is discouraged. It may prevent auto-vectorization and such a call often results in the function returning zero instead of a value between 0 and vectorlength-1.

To see how __intel_simd_lane() can be used, consider the following example:

void accumulate(float *a, float *b, float *c, d){
  *a+=sin(d);
  *b+=cos(d);
  *c+=log(d);
}

for (i=low; i<high; i++){
    accumulate(&suma, &sumb, &sumc, d[i]);
}

A first-run conversion to Intel® Cilk™ Plus vector code without __intel_simd_lane() might look like this:

#define VL 16
__declspec(vector(uniform(a,b,c), linear(i)))
void accumulate(float *a, float *b, float *c, d, i){
  a[i & (VL-1)]+=sin(d);
  b[i & (VL-1)]+=cos(d);
  c[i & (VL-1)]+=log(d);
}

float a[VL] = {0.0f};
float b[VL] = {0.0f};
float c[VL] = {0.0f};
#pragma omp simd for safe_veclen(VL)
for (i=low; i<high; i++){
    accumulate(a, b, c, d[i], i);
}
for(i=0;i<VL;i++){
    suma += a[i];
    sumb += b[i];
    sumc += c[i];
}

The gather-scatter type memory addressing caused by the references to arrays A, B, and C in the SIMD-enabled function accumulate() will significantly hurt performance making the whole conversion useless. To avoid this penalty you may use the __intel_simd_lane() built-in function as follows:

__declspec(vector(uniform(a,b,c),aligned(a,b,c)))
void accumulate(float *a, float *b, float *c, d){
  // No need to take “loop index”. No need to know VL.
  a[__intel_simd_lane()]+=sin(d);
  b[__intel_simd_lane()]+=cos(d);
  c[__intel_simd_lane()]+=log(d);
}

#define VL 16 // actual SIMD code may use vectorlength of 4 but it’s okay.
float a[VL] = {0.0f};
float b[VL] = {0.0f};
float c[VL] = {0.0f};
#pragma omp simd for safe_veclen(VL)
for (i=low; i<high; i++){
    // If low is known to be zero at compile time, “i & (VL-1)”
    // would accomplish what __intel_simd_lane() is intended for,
    // but only on the caller side.
    accumulate(a, b, c, d[i]);
}
for(i=0;i<VL;i++){
    suma += a[i];
    sumb += b[i];
    sumc += c[i];
}

With use of __intel_simd_lane() the references to the arrays in accumulate() will have unit-stride.

Limitations

The following language constructs are not allowed within SIMD-enabled functions:

The GOTO statement
The switch statement with16 or more case statements
Operations on classes and structs (other than member selection)
The _Cilk_spawn keyword and any update to the Intel® Cilk™ Plus pedigree
Expressions with array notations