Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Intel® Hyper-Threading Technology (Intel® HT Technology), an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP* pragmas, most loops with no loop-carried dependencies can be threaded with one simple statement. This topic explains how to start using OpenMP* to parallelize loops, which is also called worksharing.

Options that use OpenMP* are available for both Intel® and non-Intel microprocessors, but these options may perform additional optimizations on Intel® microprocessors than they perform on non-Intel microprocessors. The list of major, user-visible OpenMP* constructs and features that may perform differently on Intel® microprocessors than on non-Intel microprocessors includes: locks (internal and user visible), the SINGLE construct, barriers (explicit and implicit), parallel loop scheduling, reductions, memory allocation, and thread affinity and binding.

Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the Intel® C++ Compiler and OpenMP*, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP* is realized when it is used to thread hotspots, the most time-consuming loops in your application.

The power and simplicity of OpenMP* is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.

Example
#pragma omp parallel for for (i=0; i < numPixels; i++) { pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }

First, the example uses worksharing, which is the general term used in OpenMP* to describe distribution of work across threads. When worksharing is used with the for construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once with different iterations executing if there is more than one available threads. Since there is no explicit numthreads clause, OpenMP* determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP* places the following five restrictions on which loops can be threaded:

The loop variable must be of type signed or unsigned integer, random access iterator, or pointer.
The comparison operation must be in the form loop_variable <, <=, >, or >= loop_invariant_expression of a compatible type.
The third expression or increment portion of the for loop must be either addition or subtraction by a loop invariant value.
If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if the comparison operation is > or >=, the loop variable must decrement on every iteration.
The loop body must be single-entry-single-exit, meaning no jumps are permitted from inside to outside the loop, with the exception of the exit statement that terminates the whole application. If the statements goto or break are used, the statements must jump within the loop, not outside it. Similarly, for exception handling, exceptions must be caught within the loop.

Although these restrictions might sound somewhat limiting, non-conforming loops can frequently be rewritten to follow these restrictions.

Basics of Compilation

Using the OpenMP* pragmas requires an OpenMP-compatible compiler and thread-safe libraries. Adding the [Q]openmp option to the compiler instructs the compiler to pay attention to the OpenMP* pragmas and to insert threads. If you omit the [Q]openmp option, the compiler will ignore OpenMP* pragmas, which provides a very simple way to generate a single-threaded version without changing any source code.

For conditional compilation, the compiler defines the _OPENMP macro. If needed, the macro can be tested as shown in the following example.

Example
#ifdef _OPENMP fn(); #endif

A Few Simple Examples

The following examples illustrate how simple OpenMP* is to use. In common practice, additional issues need to be addressed, but these examples illustrate a good starting point.

In the first example, the following loop clips an array to the range from 0 to 255.

Example
// clip an array to 0 <= x <= 255 for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }

You can thread it using a single OpenMP* pragma; insert the pragma immediately prior to the loop:

Example
#pragma omp parallel for for (i=0; i < numElements; i++) { if (array[i] < 0) array[i] = 0; else if (array[i] > 255) array[i] = 255; }

In the second example, the loop generates a table of square roots for the numbers from 0 to 100.

Example
double value; double roots[100]; for (value = 0.0; value < 100.0; value ++) { roots[(int)value] = sqrt(value); }

Thread the loop by changing the loop variable to a signed integer or unsigned integer and inserting a #pragma omp parallel pragma.

Example
int value; double roots[100]; #pragma omp parallel for for (value = 0; value < 100; value ++) { roots[value] = sqrt((double)value); }

Avoiding Data Dependencies and Race Conditions

When a loop meets all five loop restrictions (listed above) and the compiler threads the loop, the loop still might not work correctly due to the existence of data dependencies.

Data dependencies exist when different iterations of a loop (more specifically a loop iteration that is executed on a different thread) read or write the same location in shared memory. Consider the following example that calculates factorials.

Example
// Each loop iteration writes a value that a different iteration reads. #pragma omp parallel for for (i=2; i < 10; i++) { factorial[i] = i * factorial[i-1]; }

The compiler will thread this loop, but the threading will fail because at least one of the loop iterations is data-dependent upon a different iteration. This situation is referred to as a race condition. Race conditions can only occur when using shared resources (like memory) and parallel execution. To address this problem either rewrite the loop or pick a different algorithm, one that does not contain the race condition.

Race conditions are difficult to detect because, for a given case or system, the threads might win the race in the order that happens to make the program function correctly. Because a program works once does not mean that the program will work under all conditions. Testing your program on various machines, some with Intel® Hyper-Threading Technology and some with multiple physical processors, is a good starting point to help identify race conditions.

Traditional debuggers are useless for detecting race conditions because they cause one thread to stop the race while the other threads continue to significantly change the runtime behavior; however, thread checking tools can help.

Managing Shared and Private Data

Nearly every loop (in real applications) reads from or writes to memory; it's your responsibility, as the developer, to instruct the compiler what memory should be shared among the threads and what memory should be kept private. When memory is identified as shared, all threads access the same memory location. When memory is identified as private, however, a separate copy of the variable is made for each thread to access in private. When the loop ends, the private copies are destroyed. By default, all variables are shared except for the loop variable, which is private.

Memory can be declared as private in two ways:

Declare the variable inside the loop-really inside the parallel OpenMP* pragma-without the static keyword.
Specify the private clause on an OpenMP* pragma.

The following loop fails to function correctly because the variable temp is shared. It should be private.

Example
// Variable temp is shared among all threads, so while one thread // is reading variable temp another thread might be writing to it #pragma omp parallel for for (i=0; i < 100; i++) { temp = array[i]; array[i] = do_something(temp); }

Example

// Variable temp is shared among all threads, so while one thread 
// is reading variable temp another thread might be writing to it 
#pragma omp parallel for 
for (i=0; i < 100; i++) {
  temp = array[i];
  array[i] = do_something(temp); 
}

The following two examples both declare the variable temp as private memory, which solves the problem.

Example
#pragma omp parallel for for (i=0; i < 100; i++) { int temp; // variables declared within a parallel construct // are, by definition, private temp = array[i]; array[i] = do_something(temp); }

The temp variable can also be made private in the following way:

Example
#pragma omp parallel for private(temp) for (i=0; i < 100; i++) { temp = array[i]; array[i] = do_something(temp); }

Every time you use OpenMP* to parallelize a loop, you should carefully examine all memory references, including the references made by called functions. Variables declared within a parallel construct are defined as private except when they are declared with the static declarator, because static variables are not allocated on the stack.

Reductions

Loops that accumulate a value are fairly common, and OpenMP* has a specific clause to accommodate them. Consider the following loop that calculates the sum of an array of integers.

Example
sum = 0; for (i=0; i < 100; i++) { sum += array[i]; // this variable needs to be shared to generate // the correct results, but private to avoid // race conditions from parallel execution }

The variable sum in the previous loop must be shared to generate the correct result, but it also must be private to permit access by multiple threads. OpenMP* provides the reduction clause that is used to efficiently combine the mathematical reduction of one or more variables in a loop. The following example demonstrates how the loop can use the reduction clause to generate the correct results.

Example
sum = 0; #pragma omp parallel for reduction(+:sum) for (i=0; i < 100; i++) { sum += array[i]; }

In the case of the example listed above, the reduction provides private copies of the variable sum for each thread, and when the threads exit, it adds the values together and places the result in the one global copy of the variable.

The following table lists the possible reduction operations, along with their initial values (mathematical identity values).

Operation	`private` Variable Initialization Value
+ (addition)	0
- (subtraction)	0
* (multiplication)	1
& (bitwise and)	~0
\| (bitwise or)	0
^ (bitwise exclusive or)	0
&& (conditional and)	1
\|\| (conditional or)	0

Multiple reductions in a loop are possible by specifying comma-separated variables and operations on a given parallel construct. Reduction variables must meet the following requirements:

can be listed in just one reduction.
cannot be declared constant.
cannot be declared private in the parallel construct.

Load Balancing and Loop Scheduling

Load balancing, the equal division of work among threads, is among the most important attributes for parallel application performance. Load balancing is extremely important, because it ensures that the processors are busy most, if not all, of the time. Without a balanced load, some threads may finish significantly before others, leaving processor resources idle and wasting performance opportunities.

Within loop constructs, poor load balancing is often caused by variations in compute time among loop iterations. It is usually easy to determine the variability of loop iteration compute time by examining the source code. In most cases, you will see that loop iterations consume a uniform amount of time. When that is not true, it may be possible to find a set of iterations that consume similar amounts of time. For example, sometimes the set of all even iterations consumes about as much time as the set of all odd iterations. Similarly, it might be the case that the set of the first half of the loop consumes about as much time as the second half. In contrast, it might be impossible to find sets of loop iterations that have a uniform execution time. Regardless of the case, you should provide this extra loop scheduling information to OpenMP* so it can better distribute the iterations of the loop across the threads (and therefore processors) for optimum load balancing.

If you know that all loop iterations consume roughly the same amount of time, the OpenMP* schedule clause should be used to distribute the iterations of the loop among the threads in roughly equal amounts via the scheduling policy. In addition, you need to minimize the chances of memory conflicts that may arise because of false sharing due to using large chunks. This behavior is possible because loops generally touch memory sequentially, so splitting up the loop in large chunks— like the first half and second half when using two threads— will result in the least chance for overlapping memory. While this may be the best choice for memory issues, it may be bad for load balancing. Unfortunately, the reverse is also true; what might be best for load balancing may be bad for memory performance. You must strike a balance between optimal memory usage and optimal load balancing by measuring the performance to see what method produces the best results.

Use the following general form on the parallel construct to schedule an OpenMP* loop:

Example
#pragma omp parallel for schedule(kind [, chunk size])

Four different loop scheduling types (kinds) can be provided to OpenMP*, as shown in the following table. The optional parameter (chunk), when specified, must be a positive integer.

Kind	Description
static	Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is `loop_count`/`number_of_threads`. Set chunk to 1 to interleave the iterations.
dynamic	Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
guided	Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies them minimum size chunk to use. By default the chunk size is approximately `loop_count`/`number_of_threads`.
auto	When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.
runtime	Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used. `OMP_SCHEDULE` is a string formatted exactly the same as would appear on the parallel construct.

Assume that you want to parallelize the following loop.

Example
for (i=0; i < NumElements; i++) { array[i] = StartVal; StartVal++; }

As written, the loop contains a data dependency, making it impossible to parallelize without a change. The new loop, shown below, fills the array in the same manner, but without data dependencies. The new loop benefits from using the SIMD instructions generated by the compiler.

Example
#pragma omp parallel for for (i=0; i < NumElements; i++) { array[i] = StartVal + i; }

Observe that the code is not 100% identical because the value of variable StartVal is not incremented. As a result, when the parallel loop is finished, the variable will have a value different from the one produced by the serial version. If the value of StartVal is needed after the loop, the additional statement, shown below, is needed.

Example
// This works and is identical to the serial version. #pragma omp parallel for for (i=0; i < NumElements; i++) { array[i] = StartVal + i; } StartVal += NumElements;

OpenMP* Tasking Model

The OpenMP* tasking model enables you to parallelize a large range of applications. You can use several OpenMP* pragmas for tasking.

The omp task Pragma

The omp task pragma has the following syntax:

#pragma omp task [clause[[,] clause] ...] new-line

structured-block

where clause is one of the following:

if(scalar-expression)
final (scalar expression)
untied
default(shared | none)
mergeable
private(list)
firstprivate(list)
shared(list)

The #pragma omp task defines an explicit task region as shown in the following example:

Example
void test1(LIST head) { #pragma intel omp parallel shared(head) { #pragma omp single { LIST p = head; while (p != NULL) { #pragma omp task firstprivate(p) { do_work1(p); } p = p->next; } } }

The binding thread set of the task region is the current parallel team. A task region binds to the innermost enclosing PARALLEL region. When a thread encounters a task construct, a task is generated from the structured block enclosed in the construct. The encountering thread may immediately execute the task, or defer its execution. A task construct may be nested inside an outer task, but the task region of the inner task is not a part of the task region of the outer task.

Using #pragma omp task clauses

The#pragma omp task takes an optional comma-separated list of clauses. The data environment of the task is created according to the data-sharing attribute clauses on the task construct and any defaults that apply. The example below shows a way to generate N tasks with one thread and execute them with the threads in the parallel team:

Example
#pragma omp parallel shared(data) { #pragma omp single private(i) { for (i=0, i<N; i++) { #pragma omp task firstprivate(i shared(data) { do_work(data(i)); } } } }

Task scheduling

When a thread reaches a task scheduling point, it may perform a task switch, beginning or resuming execution of a different task bound to the current team. Task scheduling points are implied at the following locations:

the point immediately following the generation of an explicit task.
after the last instruction of a task region.
in a taskwait region.
in implicit and explicit barrier regions.

When a thread encounters a task scheduling point it may do one of the following:

begin execution of a tied task bound to the current team.
resume any suspended task region, bound to the current team, to which it is tied.
begin execution of an untied task bound to the current team.
resume any suspended untied task region bound to the current team.

If more than one of the above choices is available, it is unspecified as to which will be chosen.

Task scheduling constraints

An explicit task whose construct contained an if clause whose if clause expression evaluated to false is executed immediately after generation of the task.
Other scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread, and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant of every task in the set. A program relying on any other assumption about task scheduling is non-conforming.

Note

Task scheduling points dynamically divide task regions into parts. Each part is executed from start to finish without interruption. Different parts of the same task region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified.

A correct program must behave correctly and consistently with all conceivable scheduling sequences that are compatible with the rules above.

The omp taskwait Pragma

The #pragma omp taskwait specifies a wait on the completion of child tasks generated since the beginning of the current task. A taskwait region binds to the current task region. The binding thread set of the taskwait region is the encountering thread.

The taskwait region includes an implicit task scheduling point in the current task region. The current task region is suspended at the task scheduling point until execution of all its child tasks generated before the taskwait region are completed.

Example
#pragma omp task { ... #pragma omp task { do_work1(); } #pragma omp task { ... #pragma omp task { do_work2(); } ... } #pragma omp taskwait ... }

The omp taskyield Pragma

The #pragma omp taskyield specifies that the current task can be suspended at that point and the thread may switch to the execution of a different task. You can use this pragma to provide an explicit task scheduling point at a particular point in the task.