Intel® C++ Compiler 16.0 User and Reference Guide

Programming Restrictions for Shared Local Memory

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

First-level Parallel Loop Nests

First level parallel loop nests, those marked with _Thread_group, may appear as the parallel loop for functions declared with the attribute __declspec(gfx_kernel)

Only the following constructs are allowed inside the first level parallel loop nest:

Chunk size is guaranteed to be 1 for all dimensions, so each thread group executes exactly one iteration of the first level loop nest.

Serial Code

Serial code is any code inside a first-level parallel loop nest, that is not syntactically included in any second-level parallel loop nest. For example, lines 4-5:

01 __declspec(target(gfx_kernel)) void slm_enabled_kernel(int *data, int param) {
02   _Cilk_for _Thread_group (...) {
03     ...
04     int lo = param/2; //serial code
05     int up = param*2; //serial code
06 
07     _Cilk_for (int j = lo; j < up; j++) {
08       ...
09     }
10   }
11 }

Serial code, of the kind described, is executed by the master thread of each thread group. When a parallel construct is met, such as the nested _Cilk_for loop after the serial code, the master thread splits execution among other threads in a group.

Here is an excerpt from matrix multiplication code that uses SLM, and illustrates the serial code requirement:

01  _Cilk_for _Thread_group (int tg_y = 0; tg_y < Y; tg_y += SLM_TILE_Y) {     
02      _Cilk_for _Thread_group (int tg_x = 0; tg_x < X; tg_x += SLM_TILE_X) {
03          // declare "supertiles" of each matrix to be allocated in SLM
04          __thread_group_local float slm_atile[SLM_TILE_Y][SLM_TILE_K];
05          __thread_group_local float slm_btile[SLM_TILE_K][SLM_TILE_X];
06          __thread_group_local float slm_ctile[SLM_TILE_Y][SLM_TILE_X];
07  
08          // initialize the result supertile (in parallel)
09          _Cilk_for (int i = 0; i < SLM_TILE_Y; i++)
10           _Cilk_for (int j = 0; j < SLM_TILE_X; j++)
11            slm_ctile[i][j] = 0.0;
12  
13          // calculate the dot product of current A's supertile row and
14          // B's supertile column:
15          for (int super_k = 0; super_k < K; super_k += SLM_TILE_K) {
16              // Parallel execution 
17              // cache A's and B's "supertiles" in SLM (in parallel)
18              slm_atile[:][:] = A[tg_y:SLM_TILE_Y][super_k:SLM_TILE_K];
19              slm_btile[:][:] = B[super_k:SLM_TILE_Y][tg_x:SLM_TILE_X];
20              
21              // all threads wait till copy is done
22              _gfx_gpgpu_thread_barrier();
23              
24              // parallel execution     
25              // now multiply the supertiles as usual matrices using tiled
26              // matrix multiplication algorithm (in parallel)
27              _Cilk_for (int t_y = 0; t_y < SLM_TILE_Y; t_y += TILE_Y) {
28                  _Cilk_for (int t_x = 0; t_x < SLM_TILE_X; t_x += TILE_X) {

Lines 1-2 are the first-level parallel loop nest, line 13 is the serial code, and lines 16-17 and lines 25-26 are the second-level parallel loop nests. The serial code is a loop over supertiles which calculates their dot product. This calculation is done by every thread group and this cycle is not parallelized between threads in a thread group.

Every thread in a group executes the same serial code. Code is not allowed to give different results in different threads within the same thread group and whose results could be visible outside current thread. The serial code restrictions are:

Second-level Parallel Loop Nests

Thread Group Local Data

The following syntax restrictions and semantics apply to thread group local data:

Examples

Kind

Example

Restrictions

Variable declaration

_Cilk_for _Thread_group (...) {
  _Cilk_for _Thread_group (...) {
    __thread_group_local int slm_data[N][M];
    ...
  }
}
OK. Valid SLM data declaration

Variable declaration

__thread_group_local int slm_data[N][M];
_Cilk_for _Thread_group (...) {
  _Cilk_for _Thread_group (...) {
    ...
  }
}
Invalid SLM data declaration. Must be nested within the thread group _Cilk_for nest.

Pointer declaration

_Cilk_for _Thread_group (...) {
  float * __thread_group_local p;
  ...
}

OK. Declaration of a pointer allocated in SLM.

Pointer declaration

float __thread_group_local *p; 
              

OK. Declaration of a pointer to SLM-allocated data.

Class object

class c1 {
    int i1;
    c1() {i1 = 10;}
}
...
__thread_group_local c1 obj;

Invalid; may be supported in the future with a language extension.

Return type

float __thread_group_local * bar(...)

OK. Return type is pointer.

Return type

float __thread_group_local  bar(...)

OK. Type qualifiers on rval are don't make sense and are allowed/ignored.

Structure field

struct s1 {
    __thread_group_local float *fld;
    ...
}

Declaration is OK. Not usable in some contexts.

Structure field

struct s1 {
    ...
    __thread_group_local float fld1;
}

Not allowed. Entire variable must be __thread_local_group.

Structure

Struct __thread_group_local s1 {
    float tgl[N][M];
}

OK. Generally it is expected that SLM data is arrays but any data is allowed.

Parallel loops

__declspec(target(gfx)) void foo(...) {
    _Cilk_for (...) {...}
}
...
#pragma offload target(gfx)
_Cilk_for _Thread_group(...) {
    ...
    foo(...);
}

Pragmatically this will be diagnosed as an error if foo is not inlined, but is OK from a language perspective.