Intel® C++ Compiler 16.0 User and Reference Guide

Shared Local Memory Programming Extensions for Processor Graphics

Two-Level Parallelism

With two-level loop nest parallelism, the parallel loop nest that constitutes the offloaded kernel can textually encompass a number of other parallel loop nests:

The threads executing the kernel are also structured in two levels. The entire multi-dimensional thread space is divided into thread groups equal in size and shape, where each thread group spans contiguous rectangular chunks of the thread space and is further divided into individual threads.

Each thread group has a master thread.

The compiler and offload runtime maps iterations of the first-level parallel nest and the second-level parallel nests to the thread space created at runtime to execute the loop nests in parallel. Parallelization occurs as follows:

  1. Iterations of the first-level nest are distributed among thread groups, each group executing exactly one iteration. Execution of a first-level nest iteration by a group is execution of the code within the nest by every thread of the group in parallel. This code must include one or more second-level parallel loop nests, but there can be code in between the second-level parallel loop nests, which is called serial code.

  2. The compiler transforms the second-level nests such that their iterations are distributed among threads in a group and are executed in parallel. The serial code, however, is not transformed and is executed as-is by all threads in a group. Constraints on the serial code, described later, ensure that every thread has the same program state when executing the serial interval. So all local variables have the same value in all threads.

A thread barrier pauses execution of all threads in a group until all threads in that group reach the barrier.

There is no implicit thread barrier at the end of a second-level nest, and you must explicitly insert a thread barrier if program logic requires it. The function _gfx_gpgpu_thread_barrier inserts a thread barrier.

Thread Group Local Data

There is a special kind of variable: A thread group local (TGL) variable. Each thread group has one instance of a TGL variable. This variable is allocated before a thread group starts and is destroyed after the thread group finishes. TGL variables are shared across threads in a group, so updates to a TGL variable are immediately visible in other threads. A data race is possible, such as when accessing the same element of a TGL array. But usually threads access TGL data in phases, where in every phase each thread accesses its own portion of the TGL data, not conflicting (racing) with other threads. Thread barriers between phases is a cheap way to avoid races in this scenario.

Example

With this extension, the general informal definition of what can be offloaded to the processor graphics is a first-level parallel loop nest containing a sequence of TGL data declarations, second-level parallel loop nests, thread barriers and serial code in any order. For example:

<first-level parallel loop nest> {
  <slm data declaration>
  <second-level parallel loop nest>
  <slm data declaration>
  <serial code>
  <barrier>
  <second-level parallel loop nest>
}

Syntax

The following table describes the keywords you use to implement SLM.

Keyword

Description

_Thread_group

Marks loops constituting first-level parallel loop nests, whose iteration space is distributed among thread groups. Add this keyword to every loop nest of the first-level parallel loop.

Loop nests not marked with this keyword are second-level loop nests.

Example: 2D first-level loop nest and 2D thread group space:

#pragma offload target(gfx)
 _Cilk_for _Thread_group (...) {
   _Cilk_for _Thread_group (...) {
     ... // loop nest body
   }
  }

__thread_group_local

An address space type qualifier that designates a named address space corresponding to a memory region that is local to a thread group. You can use this qualifier to:

  • Declare a TGL variable or array. The compiler must deploy this variable in thread group local storage and make sure threads in a thread group use the same location to access this variable, making it shared between threads in a group. For example, the following declares the array arr as shared between threads in a group:

    __thread_group_local int arr[100];
  • Declare a pointer to data allocated in thread group local storage. The compiler might generate different code for dereferencing such pointers, as opposed to pointers in the generic address space. For example, the following declares a pointer to a thread group local data:

    float __thread_group_local *p;

This address space always maps to shared local memory.

See Also