Optimizing Iterative Offload

This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.

Some algorithms require in-order iterative execution of parallelized code. Consider the outer-most for loop in the following example:

for (int k = 0; k < numNodes; k++) {
   #pragma offload target(gfx) …
   _Cilk_for (unsigned int y = 0; y < numNodes; ++y) {
      …
   }
}

The offload block is executed numNodes times and there is no CPU-side execution dependent on results of any intermediate iteration of the k loop. The compiler and runtime can further optimize such patterns via hoisting loop invariant offload logic out of the loop, as in the k loop in the example, and enqueueing multiple offload tasks without waiting for each separate offload task to complete, which substantially speeds up such patterns. The requirements are as follows:

The #pragma offload block should be manually or automatically inlined into the loop.
Other than incrementing the loop index, the loop should not contain any host-side computation.