Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
First level parallel loop nests, those marked with _Thread_group, may appear as the parallel loop for functions declared with the attribute __declspec(gfx_kernel)
Only the following constructs are allowed inside the first level parallel loop nest:
SLM data declaration with optional initializer
Assignment to the __thread_group_local data
second level parallel loop nests
Calls to the thread barrier intrinsic
serial code (see a dedicated section for the definition of the serial code and associated restrictions)
Chunk size is guaranteed to be 1 for all dimensions, so each thread group executes exactly one iteration of the first level loop nest.
Serial code is any code inside a first-level parallel loop nest, that is not syntactically included in any second-level parallel loop nest. For example, lines 4-5:
01 __declspec(target(gfx_kernel)) void slm_enabled_kernel(int *data, int param) { 02 _Cilk_for _Thread_group (...) { 03 ... 04 int lo = param/2; //serial code 05 int up = param*2; //serial code 06 07 _Cilk_for (int j = lo; j < up; j++) { 08 ... 09 } 10 } 11 }
Serial code, of the kind described, is executed by the master thread of each thread group. When a parallel construct is met, such as the nested _Cilk_for loop after the serial code, the master thread splits execution among other threads in a group.
Here is an excerpt from matrix multiplication code that uses SLM, and illustrates the serial code requirement:
01 _Cilk_for _Thread_group (int tg_y = 0; tg_y < Y; tg_y += SLM_TILE_Y) { 02 _Cilk_for _Thread_group (int tg_x = 0; tg_x < X; tg_x += SLM_TILE_X) { 03 // declare "supertiles" of each matrix to be allocated in SLM 04 __thread_group_local float slm_atile[SLM_TILE_Y][SLM_TILE_K]; 05 __thread_group_local float slm_btile[SLM_TILE_K][SLM_TILE_X]; 06 __thread_group_local float slm_ctile[SLM_TILE_Y][SLM_TILE_X]; 07 08 // initialize the result supertile (in parallel) 09 _Cilk_for (int i = 0; i < SLM_TILE_Y; i++) 10 _Cilk_for (int j = 0; j < SLM_TILE_X; j++) 11 slm_ctile[i][j] = 0.0; 12 13 // calculate the dot product of current A's supertile row and 14 // B's supertile column: 15 for (int super_k = 0; super_k < K; super_k += SLM_TILE_K) { 16 // Parallel execution 17 // cache A's and B's "supertiles" in SLM (in parallel) 18 slm_atile[:][:] = A[tg_y:SLM_TILE_Y][super_k:SLM_TILE_K]; 19 slm_btile[:][:] = B[super_k:SLM_TILE_Y][tg_x:SLM_TILE_X]; 20 21 // all threads wait till copy is done 22 _gfx_gpgpu_thread_barrier(); 23 24 // parallel execution 25 // now multiply the supertiles as usual matrices using tiled 26 // matrix multiplication algorithm (in parallel) 27 _Cilk_for (int t_y = 0; t_y < SLM_TILE_Y; t_y += TILE_Y) { 28 _Cilk_for (int t_x = 0; t_x < SLM_TILE_X; t_x += TILE_X) {
Lines 1-2 are the first-level parallel loop nest, line 13 is the serial code, and lines 16-17 and lines 25-26 are the second-level parallel loop nests. The serial code is a loop over supertiles which calculates their dot product. This calculation is done by every thread group and this cycle is not parallelized between threads in a thread group.
Every thread in a group executes the same serial code. Code is not allowed to give different results in different threads within the same thread group and whose results could be visible outside current thread. The serial code restrictions are:
Only local variables and formal parameters (for async kernels) or #pragma offload parameters (for offload blocks) can be accessed; so, for example, access to static variables or thread group local variables is not allowed
You can only offload perfect loop nests to the processor graphics. This is also true for two-level parallelism, where the first-level nest must be perfect. This implies that the local variables mentioned are those declared inside the first-level nest.
Function calls are not allowed.
Memory updates, such as those through pointer parameter dereference, are not allowed.
Local variables used in second-level parallel nests but defined outside of the second level parallel loops are treated as firstprivate. If such a variable is live, that is, its value is used, after the loop nest, no updates of this variable are allowed within the loop nest.
Second-level parallel loop nests can be a perfect _Cilk_for loop nest
The loops within the nest must be perfectly nested
Second-level parallel loop nests must be textually included into the first-level nest. They cannot reside in a function called from the first-level nest
At least one second level parallel loop nest must be present.
The following syntax restrictions and semantics apply to thread group local data:
You must declare variables as local, immediately nested in a first-level parallel loop nest
__thread_group_local is always mapped to SLM, so the total size of the data cannot exceed the available SLM
Lifetime:
__thread_group_local data is allocated upon the start of a thread group, immediately before any of the group’s threads start execution, and de-allocated upon thread group end, immediately after executing the last thread.
Initializers are allowed and will become assigned to the SLM data
Initializers are executed by the master thread only.
Without initialization there is no defined initial value.
Variables that can be declared __thread_group_local are limited to scalars, arrays of scalars, and PODS.
Kind |
Example |
Restrictions |
---|---|---|
Variable declaration |
_Cilk_for _Thread_group (...) { _Cilk_for _Thread_group (...) { __thread_group_local int slm_data[N][M]; ... } } |
OK. Valid SLM data declaration |
Variable declaration |
__thread_group_local int slm_data[N][M]; _Cilk_for _Thread_group (...) { _Cilk_for _Thread_group (...) { ... } } |
Invalid SLM data declaration. Must be nested within the thread group _Cilk_for nest. |
Pointer declaration |
_Cilk_for _Thread_group (...) { float * __thread_group_local p; ... } |
OK. Declaration of a pointer allocated in SLM. |
Pointer declaration |
float __thread_group_local *p; |
OK. Declaration of a pointer to SLM-allocated data. |
Class object |
class c1 { int i1; c1() {i1 = 10;} } ... __thread_group_local c1 obj; |
Invalid; may be supported in the future with a language extension. |
Return type |
float __thread_group_local * bar(...) |
OK. Return type is pointer. |
Return type |
float __thread_group_local bar(...) |
OK. Type qualifiers on rval are don't make sense and are allowed/ignored. |
Structure field |
struct s1 { __thread_group_local float *fld; ... } |
Declaration is OK. Not usable in some contexts. |
Structure field |
struct s1 { ... __thread_group_local float fld1; } |
Not allowed. Entire variable must be __thread_local_group. |
Structure |
Struct __thread_group_local s1 { float tgl[N][M]; } |
OK. Generally it is expected that SLM data is arrays but any data is allowed. |
Parallel loops |
__declspec(target(gfx)) void foo(...) { _Cilk_for (...) {...} } ... #pragma offload target(gfx) _Cilk_for _Thread_group(...) { ... foo(...); } |
Pragmatically this will be diagnosed as an error if foo is not inlined, but is OK from a language perspective. |