Intel® C++ Compiler 16.0 User and Reference Guide
If a program has sufficient parallelism and burdened parallelism but still doesn't achieve good speedup, the performance could be affected by other factors. Here are a few common factors, some of which are discussed elsewhere.
cilk_for Grainsize Setting. If the grain size is too large, the program's logical parallelism decreases. If the grain size is too small, overhead associated with each spawn could compromise the parallelism benefits. The Intel compiler and runtime system use a default formula to calculate the grain size. The default works well under most circumstances. If your program uses cilk_for, experiment with different grain sizes to tune performance.
Lock contention. Locks generally reduce program parallelism and therefore affect performance. Lock usage can be analyzed using performance and profiling tools.
Cache efficiency and memory bandwidth. Covered later in this section.
False sharing. Covered later in this section.
Atomic operations. Atomic operations, provided by compiler intrinsics, lock cache lines. Therefore, these operations can impact performance the same way that lock contention does. Also, since an entire cache line is locked, there can be false sharing.