Intel® C++ Compiler 16.0 User and Reference Guide

Initiating an Offload on Intel® Graphics Technology

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

The code inside _Cilk_for loops or _Cilk_for loop nests following #pragma offload target(gfx) and in functions qualified with #pragma offload target(gfx) or #pragma offload target(gfx_kernel) is compiled to both the target and the CPU. In addition to target attribute, the functions can be qualified with vector attributes using __declspec(target(gfx)) (Windows* and Linux*) or __attribute__((target(gfx))) (Linux* only). Using target(gfx_kernel) gives both host and target versions, but the target version cannot be called from the offload region. Rather, it must be passed as an argument to the asynchronous offload API, which is discussed in Asynchronous Offloading.

You can place #pragma offload target(gfx) only before a parallel loop, a perfect parallel loop nest, or an Intel® Cilk™ Plus array notation statement. The parallel loop must be expressed using a _Cilk_for loop.

#pragma offload can contain the following clauses when programming for Intel® Graphics Technology:

Note

Using pin substantially reduces the cost of offloading because instead of copying data to or from memory accessible by the target, the pin clause organizes sharing the same physical memory area between the host and the target, which is much faster. For kernels that perform substantial work on a relatively small data size, such as O(N2)), this optimization is not important.

Howeversd, it makes OS lock pinned memory pages making them non-swappable, so excessive pinning may cause overall system performance degradation.

Although by default the compiler builds an application that runs on both the host CPU and target, you can also compile the same source code to run on just the CPU, using the negative form of the [Q]offload compiler option.

Example: Offloading to the Target

unsigned parArrayRHist[256][256],
     parArrayGHist[256][256], parArrayBHist[256][256];

#pragma offload target(gfx) if (do_offload) \
     pin(inputImage: length(imageSize)) \
     out(parArrayRHist, parArrayGHist, parArrayBHist)

     __Cilk_for (int ichunk = 0; ichunk < chunkCount; ichunk++){
          …
     }

In the example above, the generated CPU code and the runtime do the following:

Example: Offloading Using Perfectly Nested _Cilk_for Loops

float (* A)[k] = (float (*)[k])matA;
float (* B)[n] = (float (*)[n])matB;
float (* C)[n] = (float (*)[n])matC;

#pragma offload target(gfx) if (do_offload) \
     pin(A: length(m*k)), pin(B: length(k*n)), pin(C: length(m*n))

     __Cilk_for (int r = 0; r < m; r += TILE_m) {
          __Cilk_for (int c = 0; c < n; c += TILE_n) {
               …
          }
     }

In the example above:

See Also