Intel® C++ Compiler 16.0 User and Reference Guide
Multiple levels of cache, divided into cache lines, speed access to memory in modern computers. False sharing is a common problem in shared memory parallel processing. It occurs when two or more cores hold a copy of the same memory cache line.
If one core writes to a variable in a cache line, that entire cache line is invalidated on all other cores. Even though another core may not be using that data (reading or writing), it may be using another element of data on the same cache line. The second core will need to reload the line before it can access its own data again.
The cache hardware ensures data coherency, but at a potentially high performance cost if false sharing is frequent. A good technique to identify false sharing problems is to catch unexpected sharp increases in last-level cache misses using hardware counters or other performance tools.
As a simple example, consider the following:
… int a, b, c; cilk_spawn f(&b); c++; …
The compiler will likely place the variables a, b, and c in the same cache line. If the variables are placed in the same cache line, false sharing will occur between variable b and c when the spawned function f accesses b, while the parent accesses c. Use the avoid_false_share attribute/declspec here; otherwise, restructure the declaration using alignment directives or by placing the variables in structs with field alignment to avoid cache line sharing.
Here's the same code using the avoid_false_share attribute:
… int a; __attribute__((avoid_false_share)) int b; // For Windows* use __declspec(avoid_false_share) int b; int c; cilk_spawn f(&b); c++; …
In another simple example, consider a spawned function with a for loop that increments array values. The array is volatile to force the compiler to generate store instructions rather than hold values in registers or optimize the loop.
volatile int x[32]; void f(volatile int *p) { for (int i = 0; i < 100000000; i++) { ++p[0]; ++p[16]; } } int main() { cilk_spawn f(&x[0]); cilk_spawn f(&x[1]); cilk_spawn f(&x[2]); cilk_spawn f(&x[3]); cilk_sync; return 0; }
The x[] elements are four bytes wide, and a 64-byte cache line would hold 16 elements. There are no data races, and the results will be correct when the loop completes. However, cache line contention as the individual strands update adjacent array elements can degrade performance, sometimes significantly.