Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
Identify serial code in your OpenMP application and estimate potential gain from optimization for the detected inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations.
Intel® VTune™ Amplifier provides OpenMP analysis data in the HPC Performance Characterization(VTune Amplifier XE only), Hotspots, and Hotspots by CPU Usage viewpoints. Consider the following steps for the OpenMP analysis:
Start your analysis with understanding the CPU utilization of your analysis target. If you are using the HPC Performance Characterization viewpoint, focus on the CPU Utilization section of the Summary window that shows the number of used logical CPUs and estimates the efficiency (in percent) of this CPU utilization. Poor CPU utilization is flagged as a performance issue.
Other viewpoints provide the CPU Usage Histogram that displays the Elapsed time of your application, broken down by CPU utilization levels. The histogram shows only useful utilization so the CPU cycles that were spent by the application burning CPU in spin loops (active wait) are not counted. You can adjust sliders from the default levels if you intentionally use a number of OpenMP working threads less than the number of available hardware threads.
If the bars are close to Ideal utilization, you might need to look deeper, at algorithm or microarchitecture tuning opportunities, to find performance improvements. If not, explore the OpenMP Analysis section of the Summary window for inefficiencies in parallelization of the application:
This section of the Summary window shows the Collection Time as well as the duration of serial (outside of any parallel region) and parallel portions of the program. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.
Use the OpenMP Region Duration histogram in the Summary window to analyze instances of an OpenMP region, explore the time distribution of instance durations and identify Fast/Good/Slow region instances. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values. Adjust the thresholds as needed.
Use this data for further detailed analysis in the grid views with OpenMP Region/OpenMP Region Duration Type/... grouping levels.
To analyze the serially executed code, switch to the Bottom-up window, select the /OpenMP Region/Thread/Function grouping, filter the view by the OMP Master Thread of the [Serial - outside any region] row and click the Effective Time by Utilization column header to sort the data by CPU time utilization:
To estimate the efficiency of CPU utilization in the parallel part of the code, use the Potential Gain metric. This metric estimates the difference in the Elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution.
The Summary window provides a detailed table listing the top five parallel regions with the highest Potential Gain metric values. For each parallel region defined by the pragma #omp parallel, this metric is a sum of potential gains of all instances of the parallel region.
If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to the Bottom-up window employing the /OpenMP Region/OpenMP Barrier-to-Barrier Segment/.. dominant grouping that provides detailed analysis of inefficiency metrics like Imbalance by barriers.
Intel OpenMP runtime from Intel Parallel Studio instruments barriers for the VTune Amplifier. VTune Amplifier introduces a notion of barrier-to-barrier OpenMP region segment that spans from a region fork point or previous barrier to the barrier that defines the segment.
In the example above, there are four barrier-to-barrier segments defined as a user barrier, implicit single barrier, implicit omp for loop barrier and region join barrier.
For the cases when an OpenMP region contains multiple barriers either implicit with parallel loops or #pragma single sections, or explicit with user barriers, analyze the impact of a particular construct or a barrier to inefficiency metrics.
A barrier type is embedded to the segment name, for example: loop, single, reduction, and others. It also emits additional information for parallel loops with implicit barriers like loop scheduling, chunk size and min/max/average of the loop iteration counts that is useful to understand imbalance or scheduling overhead nature. The loop iteration count information is also helpful to identify problems with underutilization of worker threads with small number of iterations that can be a result of outer loop parallelization. Consider inner loop parallelization or "collapse" clause to saturate the working threads in this case.
Analyze the Potential Gain column data that shows a breakdown of Potential Gain in the region by representing the cost (in elapsed time) of the inefficiencies with a normalization by the number of OpenMP threads. Elapsed time cost helps decide whether you need to invest into addressing a particular type of inefficiency. VTune Amplifier can recognize the following types of inefficiencies:
Imbalance: threads are finishing their work in different time and waiting on a barrier. If imbalance time is significant, try dynamic type of scheduling. Intel OpenMP runtime library from Intel Parallel Studio Composer Edition reports precise imbalance numbers and the metrics do not depend on statistical accuracy as other inefficiencies that are calculated based on sampling.
Lock Contention: threads are waiting on contended locks or "ordered" parallel loops. If the time of lock contention is significant, try to avoid synchronization inside a parallel construct with reduction operations, thread local storage usage, or less costly atomic operations for synchronization.
Creation: overhead on a parallel work arrangement. If the time for parallel work arrangement is significant, try to make parallelism more coarse-grain by moving parallel regions to an outer loop.
Scheduling: OpenMP runtime scheduler overhead on a parallel work assignment for working threads. If scheduling time is significant, which often happens for dynamic types of scheduling, you can use a "dynamic" schedule with a bigger chunk size or "guided" type of schedule.
Atomics: OpenMP runtime overhead on performing atomic operations.
Reduction: time spent on reduction operations.
Tasking: time spent allocating and completing tasks. If the time is significant, consider increasing task granularity to reduce overhead.
If the Potential Gain column is not expandable for earlier versions of Intel OpenMP runtime, analyze the corresponding CPU Time metric breakdown.
To analyze the source of a performance-critical OpenMP parallel region, double-click the region identifier in the grid, sorted by the OpenMP Region/.. grouping level. VTune Amplifier opens the source view at the beginning of the selected OpenMP region in the pseudo function created by the Intel compiler.
By default, the Intel compiler does not add a source file name to region names, so the unknown string shows up in the OpenMP parallel region name. To get the source file name in the region name, use the -parallel-source-info=2 option during compilation.
For MPI analysis result including more than one process with OpenMP regions, the Summary window shows a section with top processes laying on a critical path of MPI execution with Serial Time and OpenMP Potential Gain aggregated per process:
Clicking the process name links leads you to the Bottom-up window grouped by /Process/OpenMP Region/.. where you can get more details on OpenMP inefficiencies for MPI ranks.
To explore an impact from multiple synchronization objects inside a region (for example, #pragma omp critical), use the Locks and Waits predefined analysis type. The Locks and Waits analysis is based on synchronization object function tracing, so big contention on synchronization objects can cause significant runtime overhead because of the analysis. Try to avoid synchronization inside regions using OpenMP reduction or thread local storage where possible.