Allocating Variables and Arrays Efficiently

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

The Intel® Graphics Technology Register File (GRF) is a register file with flexible addressing modes, allowing for both direct register naming and indirect access of GRF sub-regions.

During compile time, the compiler tries to allocate variables and arrays with automatic storage, within a function or block scope, on the GRF when possible.

The following conditions must be true to enable GRF allocation of a variable, including arrays:

The size of the variable is less than 3K and is known at compile time.
Pointers to the variable do not escape to other functions that are not inlined by the compiler, although you can pass pointers to such local arrays to other functions, if the compiler inlines those functions to benefit from GRF performance.
Pointers to different variables do not merge into a single pointer variable.

If any of these conditions are not true, the variable or array is allocated in the stack memory area.

GRF access is very efficient because of its low access latency and short instruction sequences. GRF-allocated arrays may be particularly useful for caching uniform data. But consider the following:

The most efficient code results from fully unrolling loops containing references to GRF arrays, so that all references to the arrays are known at compile time. The resulting code contains only named registers and no indirect access to the GRF. The target compiler might unroll some of the loops, but in most cases you need to apply #pragma unroll(N) to explicitly request unrolling of a loop.
Due to various hardware restrictions, the compiler may fail to vectorize indirect accesses to GRF. Often, both the simplest and the highest performance solution would be to ensure that loops referring to GRF arrays are unrolled to avoid indirect accesses.
Performance penalties can occur because of unaligned vector accesses, so try to make all vector accesses to GRF arrays 32-byte-aligned. The compiler tries to follow this recommendation during vectorization, but you also need to consider this possibility, especially when operating on short arrays. For example, for an Array Notation section such as intArr[i : VL], it is recommended to ensure that i is divisible by 8 (4-byte elements).
Allocating too much memory for arrays and other automatic variables on GRF may lead to register pressure exceeding the GRF size, which is 4KB per thread, at some code points.

Although the Intel® Graphics Technology driver includes a just-in-time (JIT) compiler that supports spilling, spilling impacts performance adversely, and also might exceed the limit for the spill memory area imposed by the JIT compiler and the driver. If something is spilled, the JIT compiler emits a warning, and if the limit for the spill area is exceeded, a run time error is generated. Look for these warnings and errors when running your application. Consider the GRF size when selecting local array sizes and vector lengths.

Note

By default, the compiler generates an intermediate form of code for Intel® Graphics Technology. At application runtime, the Intel® Graphics Technology driver's just-in-time (JIT) compiler produces executable code from the intermediate form.

You can also compile executable code directly using the compiler option mgpu-arch (Linux*) or Qgpu-arch (Windows*).

Allocating Variables and Arrays Efficiently

Note

See Also