Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® 64 and IA-32 architectures targeting Intel® Graphics Technology.
Some algorithms require in-order iterative execution of parallelized code. Consider the outer-most for loop in the following example:
for (int k = 0; k < numNodes; k++) { #pragma offload target(gfx) … _Cilk_for (unsigned int y = 0; y < numNodes; ++y) { … } }
The offload block is executed numNodes times and there is no CPU-side execution dependent on results of any intermediate iteration of the k loop. The compiler and runtime can further optimize such patterns via hoisting loop invariant offload logic out of the loop, as in the k loop in the example, and enqueueing multiple offload tasks without waiting for each separate offload task to complete, which substantially speeds up such patterns. The requirements are as follows:
The #pragma offload block should be manually or automatically inlined into the loop.
Other than incrementing the loop index, the loop should not contain any host-side computation.