Intel® C++ Compiler 16.0 User and Reference Guide

Optimize the Serial Program

The first step is to ensure that the C/C++ serial program has good performance and that normal optimization methods, including compiler optimization, have already been used.

As one simple, and limited, illustration of the importance of serial program optimization, consider the matrix_multiply example, which organizes the loop with the intent of minimizing cache line misses. The resulting code is:

cilk_for(unsigned int i = 0; i < n; ++i){
   for (unsigned int k = 0; k < n; ++k) {
     for (unsigned int j = 0; j < n; ++j) {
       A[i*n + j] += B[i*n + k] * C[k*n + j];
      }
    }
 }

In multiple performance tests, this organization has resulted in a significant performance advantage compared to the same program with the two inner loops (the k and j loops) interchanged. This performance difference shows up in both the serial and Intel® Cilk™ Plus parallel programs. The matrix example has a similar loop structure. Be aware, however, that such performance improvements cannot be assured on all systems as there are numerous architectural factors that can affect performance.