Cache Efficiency and Bandwidth

Good cache efficiency is important for serial programs, and it becomes even more important for parallel programs running on multicore machines. The cores contend for bus bandwidth, limiting how quickly data that can be transferred between memory and the processors. Therefore, consider cache efficiency and data and spatial locality when designing and implementing parallel programs. For example code that considers these issues, see the matrix and matrix_multiply examples cited in Optimize the Serial Program.

A simple way to identify bandwidth problems is to run multiple copies of the serial program simultaneously, one for each core on your system. If the average running time of the serial programs is much larger than the time of running just one copy of the program, it is likely that the program is saturating system bandwidth. The cause could be memory bandwidth limits or, perhaps, disk or network I/O bandwidth limits.

These bandwidth performance effects are frequently system-specific. For example, when running the matrix example on a specific system with two cores (call it "S2C"), the "iterative parallel" version may be considerably slower than the "iterative sequential" version (4.431 seconds compared to 1.435 seconds). On other systems, however, the iterative parallel version may show nearly linear speedup when tested with as many as 16 cores and workers. Here are the results on S2C:

1) Naive, Iterative Algorithm. Sequential and Parallel.
Running Iterative Sequential version...
  Iterative Sequential version took 1.435 seconds.
Running Iterative Parallel version...
  Iterative Parallel version took 4.431 seconds.
  Parallel Speedup: 0.323855

There are multiple, often complex and unpredictable, reasons that memory bandwidth is better on one system than another (including DRAM speed, number of memory channels, cache and page table architecture, number of CPUs on a single die). Be aware that such effects are possible and may cause unexpected and inconsistent performance results. This situation is inherent to parallel programs and is not unique to Intel® Cilk™ Plus programs.