Intel® C++ Compiler 16.0 User and Reference Guide

cilk grainsize

Specifies the grain size for one cilk_for loop.

Syntax

#pragma cilk grainsize = expression

Arguments

expression

Specifies the grain size of a loop. The expression is evaluated at run time.

The result of a grain size less than zero is undefined.

If the grainsize pragma is omitted or a grainsize of zero is specified, the system calculates a default value that works well for most loops.

Description

This pragma allows you to specify the grain size for a single instance of a cilk_for loop.

The grain size is the maximum number of iterations that the cilk_for loop will spawn as a single chunk. The cilk_for statement divides the loop into chunks (grains) containing one or more loop iterations. Each grain is executed serially and is spawned as a chunk during the execution of the loop.

For loops where individual iterations are particularly large or where the amount of work varies widely between iterations, use this pragma to reduce the grain size, probably down to 1. A small grain size increases parallelism and improves load balancing at the cost of increased scheduling and function call overhead. In general, the larger and more unbalanced the loop iterations, the smaller the grain size should be.

For loops whose body is short, use a larger grain size to reduce the scheduling overhead. For grain sizes larger than 1000 to 2000 iterations, the overhead of the cilk_for statement becomes inconsequential, even when the amount of work per iteration is very small. Thus, the benefit of increasing the grain size beyond these numbers is negligible. Using a grain size that is too large will reduce parallelism and impede load balancing. Specifically, you should not try to divide the loop, so that there is one grain per worker (CPU), as that will completely defeat the scheduler's attempts at load balancing and will almost certainly result in lost performance.

The default grainsize, when this pragma is not used, works well for most loops. If you do choose to change the grain size, be sure to carry out performance testing to ensure that you have made the loop faster, not slower.

If you do not specify a grain size or if you specify a grain size of zero, the system calculates the default value as if the following pragma were used:

#pragma cilk grainsize = min(2048, ceil(N / (8 * p)))

where:

This formula provides good results for most loops. For loops with less than eight times the number of workers, the grain size is set to 1 and each loop iteration may be run in parallel. For loops with more than 16,484 times the number of workers iterations, the grain size will be set to 2048.

Note

The formula for calculating the default value is not part of the Intel® Cilk™ Plus specification.

Example: Setting grain size according to the number of workers

#pragma cilk grainsize = n/(410*__cilkrts_get_nworkers())

This example sets the grain size according to the number of workers, so the grain size will be smaller on systems with more cores (and hence more workers):