Programming Restrictions for Shared Local Memory

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

First-level Parallel Loop Nests

First level parallel loop nests, those marked with _Thread_group, may appear as the parallel loop for functions declared with the attribute __declspec(gfx_kernel)

Only the following constructs are allowed inside the first level parallel loop nest:

SLM data declaration with optional initializer
Assignment to the __thread_group_local data
second level parallel loop nests
Calls to the thread barrier intrinsic
serial code (see a dedicated section for the definition of the serial code and associated restrictions)

Chunk size is guaranteed to be 1 for all dimensions, so each thread group executes exactly one iteration of the first level loop nest.

Serial Code

Serial code is any code inside a first-level parallel loop nest, that is not syntactically included in any second-level parallel loop nest. For example, lines 4-5:

01 __declspec(target(gfx_kernel)) void slm_enabled_kernel(int *data, int param) {
02   _Cilk_for _Thread_group (...) {
03     ...
04     int lo = param/2; //serial code
05     int up = param*2; //serial code
06 
07     _Cilk_for (int j = lo; j < up; j++) {
08       ...
09     }
10   }
11 }

Serial code, of the kind described, is executed by the master thread of each thread group. When a parallel construct is met, such as the nested _Cilk_for loop after the serial code, the master thread splits execution among other threads in a group.

Here is an excerpt from matrix multiplication code that uses SLM, and illustrates the serial code requirement:

01  _Cilk_for _Thread_group (int tg_y = 0; tg_y < Y; tg_y += SLM_TILE_Y) {     
02      _Cilk_for _Thread_group (int tg_x = 0; tg_x < X; tg_x += SLM_TILE_X) {
03          // declare "supertiles" of each matrix to be allocated in SLM
04          __thread_group_local float slm_atile[SLM_TILE_Y][SLM_TILE_K];
05          __thread_group_local float slm_btile[SLM_TILE_K][SLM_TILE_X];
06          __thread_group_local float slm_ctile[SLM_TILE_Y][SLM_TILE_X];
07  
08          // initialize the result supertile (in parallel)
09          _Cilk_for (int i = 0; i < SLM_TILE_Y; i++)
10           _Cilk_for (int j = 0; j < SLM_TILE_X; j++)
11            slm_ctile[i][j] = 0.0;
12  
13          // calculate the dot product of current A's supertile row and
14          // B's supertile column:
15          for (int super_k = 0; super_k < K; super_k += SLM_TILE_K) {
16              // Parallel execution 
17              // cache A's and B's "supertiles" in SLM (in parallel)
18              slm_atile[:][:] = A[tg_y:SLM_TILE_Y][super_k:SLM_TILE_K];
19              slm_btile[:][:] = B[super_k:SLM_TILE_Y][tg_x:SLM_TILE_X];
20              
21              // all threads wait till copy is done
22              _gfx_gpgpu_thread_barrier();
23              
24              // parallel execution     
25              // now multiply the supertiles as usual matrices using tiled
26              // matrix multiplication algorithm (in parallel)
27              _Cilk_for (int t_y = 0; t_y < SLM_TILE_Y; t_y += TILE_Y) {
28                  _Cilk_for (int t_x = 0; t_x < SLM_TILE_X; t_x += TILE_X) {

Lines 1-2 are the first-level parallel loop nest, line 13 is the serial code, and lines 16-17 and lines 25-26 are the second-level parallel loop nests. The serial code is a loop over supertiles which calculates their dot product. This calculation is done by every thread group and this cycle is not parallelized between threads in a thread group.

Every thread in a group executes the same serial code. Code is not allowed to give different results in different threads within the same thread group and whose results could be visible outside current thread. The serial code restrictions are:

Only local variables and formal parameters (for async kernels) or #pragma offload parameters (for offload blocks) can be accessed; so, for example, access to static variables or thread group local variables is not allowed

Note

You can only offload perfect loop nests to the processor graphics. This is also true for two-level parallelism, where the first-level nest must be perfect. This implies that the local variables mentioned are those declared inside the first-level nest.
Function calls are not allowed.
Memory updates, such as those through pointer parameter dereference, are not allowed.
Local variables used in second-level parallel nests but defined outside of the second level parallel loops are treated as firstprivate. If such a variable is live, that is, its value is used, after the loop nest, no updates of this variable are allowed within the loop nest.

Second-level Parallel Loop Nests

Second-level parallel loop nests can be a perfect _Cilk_for loop nest
The loops within the nest must be perfectly nested
Second-level parallel loop nests must be textually included into the first-level nest. They cannot reside in a function called from the first-level nest
At least one second level parallel loop nest must be present.

Thread Group Local Data

The following syntax restrictions and semantics apply to thread group local data:

You must declare variables as local, immediately nested in a first-level parallel loop nest
__thread_group_local is always mapped to SLM, so the total size of the data cannot exceed the available SLM
Lifetime:
- __thread_group_local data is allocated upon the start of a thread group, immediately before any of the group’s threads start execution, and de-allocated upon thread group end, immediately after executing the last thread.
- Initializers are allowed and will become assigned to the SLM data
  - Initializers are executed by the master thread only.
  - Without initialization there is no defined initial value.
Variables that can be declared __thread_group_local are limited to scalars, arrays of scalars, and PODS.

Examples