Intel® C++ Compiler 16.0 User and Reference Guide
The sequence of steps to create a parallel program using Intel® Cilk™ Plus is as follows:
Typically, you start with a serial C/C++ program that implements the basic functions or algorithms that you want to parallelize. Ensure that the serial program is correct. Any bugs in the serial program will occur in the parallel program, but they will be more difficult to identify and fix.
Identify the program regions that will benefit from parallel operation. Operations that are relatively long-running and which can be performed independently are prime candidates.
Use the three keywords to identify tasks that can execute in parallel:
_Cilk_spawn(or, cilk_spawn, if your program includes <cilk/cilk.h>) indicates a call to a function (a child) that can proceed in parallel with the caller (the parent).
_Cilk_sync (or, cilk_sync, if your program includes <cilk/cilk.h>) indicates that all spawned children must complete before proceeding.
_Cilk_for (or, cilk_for, if your program includes <cilk/cilk.h>) identifies a loop for which all iterations can execute in parallel.
Windows*: Use either the icl command-line tool or compile within Microsoft Visual Studio*. If using Visual Studio*, make sure that you have selected Use Intel C++ from the project context menu.
Linux*: Use the icc command.
OS X*: Use the icc (EDG compiler) /icl (CLANG compiler) command.
Run the program. If there are no Race Conditions, the parallel program produces the same result as the serial program.
Correct any race conditions with reducers or locks, or recode to resolve conflicts.
A walk-through of this process follows, using a sort program as an example.
Start with a Serial Program
The following demonstrates how to use write an Intel® Cilk™ Plus program by parallelizing a simple implementation of Quicksort.
The function name parallel_qsort avoids confusion with the Standard C Library qsort function. Some lines in the example are removed here, but line numbers are preserved.
9 #include <algorithm> 10 11 #include <iostream> 12 #include <iterator> 13 #include <functional> 14 15 16 // Sort the range between begin and end. "end" is one 17 // past the final element in the range. This is pure 18 // C++ code before Intel® Cilk™ Plus conversion. 19 20 21 void parallel_qsort(int * begin, int * end) 22 { 23 if (begin != end) { 24 --end; // Exclude last element (pivot) 25 int * middle = std::partition(begin, end, 26 std::bind2nd(std::less<int>(),*end)); 27 28 std::swap(*end,*middle); // pivot to middle 29 parallel_qsort(begin, middle); 30 parallel_qsort(++middle, ++end); // Exclude pivot 31 } 32 } 33 34 // A simple test harness 35 int qmain(int n) 36 { 37 int *a = new int[n]; 38 39 for (int i = 0; i < n; ++i) 40 a[i] = i; 41 42 std::random_shuffle(a, a + n); 43 std::cout << "Sorting " << n << " integers" << std::endl; 44 45 parallel_qsort(a, a + n); 48 49 // Confirm that a is sorted and that each element // contains the index. 50 for (int i = 0; i < n-1; ++i) { 51 if ( a[i] >= a[i+1] || a[i] != i ) { 52 std::cout << "Sort failed at location i=" << i << " a[i] = " 53 << a[i] << " a[i+1] = " << a[i+1] << std::endl; 54 delete[] a; 55 return 1; 56 } 57 } 58 std::cout << "Sort succeeded." << std::endl; 59 delete[] a; 60 return 0; 61 } 62 63 int main(int argc, char* argv[]) 64 { 65 int n = 10*1000*1000; 66 if (argc > 1) 67 n = std::atoi(argv[1]); 68 69 return qmain(n); 70 }
Parallelism using _Cilk_spawn
You are now ready to introduce parallelism into the qsort program.
The _Cilk_spawn keyword indicates that a function (the child) may be executed in parallel with the code that follows the _Cilk_spawn statement (the parent). The keyword allows but does not require parallel operation. Intel® Cilk™ Plus dynamically determines which operations are executed in parallel when multiple processors are available. The _Cilk_sync statement indicates that the function may not continue until all preceding _Cilk_spawn requests in the same function have completed. _Cilk_sync does not affect parallel strands spawned in other functions.
21 void parallel_qsort(int * begin, int * end) 22 { 23 if (begin != end) { 24 --end; // Exclude last element (pivot) 25 int * middle = std::partition(begin, end, 26 std::bind2nd(std::less<int>(),*end)); 27 28 std::swap(*end,*middle); // pivot to middle 29 _Cilk_spawn parallel_qsort(begin, middle); 30 parallel_qsort(++middle, ++end); // Exclude pivot 31 _Cilk_sync; 32 } 33 }
The previous example can be rewritten to use a simpler form of the keywords. Include the header file <cilk/cilk.h>, which defines macros that provide names in lowercase, without the initial underscore. The following shows the simpler naming convention of cilk_spawn and cilk_sync. This naming convention is used throughout this section.
19 #include <cilk/cilk.h> 21 void parallel_qsort(int * begin, int * end) 22 { 23 if (begin != end) { 24 --end; // Exclude last element (pivot) 25 int * middle = std::partition(begin, end, 26 std::bind2nd(std::less<int>(),*end)); 27 28 std::swap(*end, *middle); // pivot to middle 29 cilk_spawn parallel_qsort(begin, middle); 30 parallel_qsort(++middle, ++end); // Exclude pivot 31 cilk_sync; 32 } 33 }
In either example, the statement in line 29 spawns a recursive invocation of parallel_qsort that can execute asynchronously. Thus, when parallel_qsort is called again in line 30, the call at line 29 may not have completed. The cilk_sync statement at line 31 indicates that this function will not continue until all cilk_spawn requests in the same function have completed.
There is an implicit cilk_sync at the end of every function that waits until all tasks spawned in the function have returned, so the cilk_sync at line 31 is redundant, but included here for clarity.
The above change implements a typical divide and conquer strategy for parallelizing recursive algorithms. At each level of recursion, two-way parallelism occurs; the parent strand (line 29) continues executing the current function, while a child strand executes the other recursive call. This recursion can expose quite a lot of parallelism.
Execute and Test
With these changes, you can now build and execute the Intel® Cilk™ Plus version of the qsort program. Build and run the program as done in the previous example:
Linux* and OS X*: icc qsort.cpp -o qsort
OS X*: icc qsort.cpp -o qsort (with EDG compiler)/icl qsort.cpp -o qsort (with CLANG compiler)
Windows* Command Line: icl qsort.cpp
Windows Visual Studio*: Build the Release configuration.
Run qsort from the command line; for example, on Windows*:
>qsort Sorting 10000000 integers Sort succeeded.
Observe speed up on a multicore system
By default, an Intel® Cilk™ Plus program queries the operating system and uses all available cores. (Additionally, on systems that support simultaneous multithreading on each core, Intel® Cilk™ Plus uses all available hardware threads.) You can control the number of workers by setting the CILK_NWORKERS environment variable.
Run qsort using one and then two cores. On a system with two or more cores, you should expect to see that the second run takes about half as long as the first run.
Linux* and OS X*:
$CILK_NWORKERS=1 ./qsort Sorting 10000000 integers Sort succeeded. $CILK_NWORKERS=2 ./qsort Sorting 10000000 integers Sort succeeded.
Windows* Command Line:
>set CILK_NWORKERS=1 >qsort Sorting 10000000 integers Sort succeeded. >set CILK_NWORKERS=2 >qsort Sorting 10000000 integers Sort succeeded.