Intel® C++ Compiler 16.0 User and Reference Guide
This topic discusses how to use the OpenMP* library functions and environment variables and discusses some guidelines for enhancing performance with OpenMP*.
OpenMP* provides specific function calls, and environment variables. See the following topics to refresh you memory about the primary functions and environment variable used in this topic:
To use the function calls, include the omp.h header file . This file is installed in the INCLUDE directory during the compiler installation, and compile the application using the[Q]openmp option.
The following example, which demonstrates how to use the OpenMP* functions to print the alphabet, also illustrates several important concepts:
Example |
---|
#include <stdio.h> #include <omp.h> int main(void) { int i; omp_set_num_threads(4); #pragma omp parallel private(i) { // OMP_NUM_THREADS is not a multiple of 26, // which can be considered a bug in this code. int LettersPerThread = 26 / omp_get_num_threads(); int ThisThreadNum = omp_get_thread_num(); int StartLetter = 'a'+ThisThreadNum*LettersPerThread; int EndLetter = 'a'+ThisThreadNum*LettersPerThread+LettersPerThread; for (i=StartLetter; i<EndLetter; i++) { printf("%c", i); } } printf("\n"); return 0; } |
Debugging threaded applications is a complex process because debuggers change the run-time performance, which can mask race conditions. Even print statements can mask issues, because they use synchronization and operating system functions. OpenMP* itself also adds some complications, because it introduces additional structure by distinguishing private variables and shared variables, and inserts additional code. A debugger that supports OpenMP* can help you to examine variables and step through threaded code. You can use Intel® Inspector to detect many hard-to-find threading errors analytically. Sometimes, a process of elimination can help identify problems without resorting to sophisticated debugging tools.
Remember that most mistakes are race conditions. Most race conditions are caused by shared variables that really should have been declared private. Start by looking at the variables inside the parallel regions and make sure that the variables are declared private when necessary. Next, check functions called within parallel constructs. By default, variables declared on the stack are private, but the C/C++ keyword static changes the variable to be placed on the global heap and therefore shared for OpenMP* loops.
The default(none) clause, shown below, can be used to help find those hard-to-spot variables. If you specify default(none), then every variable must be declared with a data-sharing attribute clause.
Example |
---|
#pragma omp parallel for default(none) private(x,y) shared(a,b) |
Another common mistake is using uninitialized variables. Remember that private variables do not have initial values upon entering a parallel construct. Use the firstprivate and lastprivate clauses to initialize them only when necessary, because doing so adds extra overhead.
If you still can't find the bug, then consider the possibility of reducing the scope. Try a binary-hunt. Force parallel sections to be serial again with if(0) on the parallel construct or commenting out the pragma altogether. Another method is to force large chunks of a parallel region to be critical sections. Pick a region of the code that you think contains the bug and place it within a critical section. Try to find the section of code that suddenly works when it is within a critical section and fails when it is not. Now look at the variables, and see if the bug is apparent. If that still doesn't work, try setting the entire program to run in serial by setting the compiler-specific environment variable KMP_LIBRARY=serial.
If the code is still not working, and you are not using any OpenMP* API function calls, compile it without the [Q]openmp option to make sure the serial version works. If you are using OpenMP* API function calls, use the [Q]openmp-stubs option.
OpenMP* threaded application performance is largely dependent upon the following things:
The underlying performance of the single-threaded code.
CPU utilization, idle threads, and load balancing.
The percentage of the application that is executed in parallel by multiple threads.
The amount of synchronization and communication among the threads.
The overhead needed to create, manage, destroy, and synchronize the threads, made worse by the number of single-to-parallel or parallel-to-single transitions called fork-join transitions.
Performance limitations of shared resources such as memory, bus bandwidth, and CPU execution units.
Memory conflicts caused by shared memory or falsely shared memory.
Performance always begins with a properly constructed parallel algorithm or application. For example, parallelizing a bubble-sort, even one written in hand-optimized assembly language, is not a good place to start. Keep scalability in mind; creating a program that runs well on two CPUs is not as efficient as creating one that runs well on n CPUs. With OpenMP*, the number of threads is chosen by the compiler, so programs that work well regardless of the number of threads are highly desirable. Producer/consumer architectures are rarely efficient, because they are made specifically for two threads.
Once the algorithm is in place, make sure that the code runs efficiently on the targeted Intel® architecture; a single-threaded version can be a big help. Turn off the [Q]openmp option to generate a single-threaded version, or build with the [Q]openmp-stubs option, and run the single-threaded version through the usual set of optimizations.
Once you have gotten the single-threaded performance, it is time to generate the multi-threaded version and start doing some analysis.
Optimizations are really a combination of patience, experimentation, and practice. Make little test programs that mimic the way your application uses the computer resources to get a feel for what things are faster than others. Be sure to try the different scheduling clauses for the parallel sections of code. If the overhead of a parallel region is large compared to the compute time, you may want to use an if clause to execute the section serially.