Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help

OpenMP* Analysis from the Command Line

Use the Intel® VTune™ Amplifier command line interface for performance analysis of OpenMP* applications compiled with Intel® Compiler.

Prerequisites:

OpenMP is a fork-join parallel model, which starts with an OpenMP program running with a single master serial-code thread. When a parallel region is encountered, that thread forks into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for work coordination. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code.

Ideally, parallelized applications have working threads doing useful work from the beginning to the end of execution, utilizing 100% of available CPU core processing time. In real life, useful CPU utilization is likely to be less when working threads are waiting, either actively spinning (for performance, expecting to have a short wait) or waiting passively, not consuming CPU. There are several major reasons why working threads wait, not doing useful work:

VTune Amplifier together with Intel Composer XE 2013 Update 2 or higher help you understand how an application utilizes available CPUs and identify causes of CPU underutilization.

Configuring and Running an Analysis

Use the following syntax to run the OpenMP analysis from the command line:

amplxe-cl <-action> <analysis_type> -knob analyze-openmp=true [[--] <target>]

where

Note

Only the following analysis types support OpenMP analysis: Basic Hotspots, Advanced Hotspots, Concurrency, HPC Performance Characterization, General Exploration, Memory Access, or any Custom Analysis. The HPC Performance Characterization has this knob enabled by default.

You are recommended to run the HPC Performance Characterization analysis to analyze OpenMP applications. Unlike other analysis types, the HPC Performance Characterization analysis generates a summary report with OpenMP metrics and descriptions of detected performance issues.

Example

The following command runs HPC Performance analysis (OpenMP analysis is enabled by default):

$ amplxe-cl -collect hpc-performance -- myApp

Viewing Summary Report Data

When the data collection is complete, the VTune Amplifier automatically generates the summary report. Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target.

Use the following syntax to generate the Summary report from a preexisting result:

$ amplxe-cl -report summary -result-dir <result_path>

For HPC Performance Characterization analysis, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:

Explore the OpenMP Analysis section of the summary report for inefficiencies in parallelization of the application:

Serial Time: 0.069s (0.3%)
Parallel Region Time: 23.113s (99.7%)
    Estimated Ideal Time: 14.010s (60.4%)
    OpenMP Potential Gain: 9.103s (39.3%)
     | The time wasted on load imbalance or parallel work arrangement is
     | significant and negatively impacts the application performance and
     | scalability. Explore OpenMP regions with the highest metric values.
     | Make sure the workload of the regions is enough and the loop schedule
     | is optimal.

This section shows the Collection Time as well as the duration of serial (outside of any parallel region) and parallel portions of the program. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.

Estimating Potential Gain

To estimate the efficiency of CPU utilization in the parallel part of the code, use the Potential Gain metric. This metric estimates the difference in the Elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution.

Use the hotspots report to identify the hottest program units. Use the following command to list the top five parallel regions with the highest Potential Gain metric values:

$ amplxe-cl -report hotspots -result-dir r001hpc -group-by=region -sort-desc="OpenMP Potential Gain:Self" -column="OpenMP Potential Gain:Self" -limit 5

where

The command above produces the following output:

OpenMP Region                                                     OpenMP Potential Gain
----------------------------------------------------------------  ---------------------
compute_rhs_$omp$parallel:24@/root/work/apps/OMP/SP/rhs.f:17:433                 3.417s
x_solve_$omp$parallel:24@/root/work/apps/OMP/SP/x_solve.f:27:315                 0.920s
z_solve_$omp$parallel:24@/root/work/apps/OMP/SP/z_solve.f:31:321                 0.913s
y_solve_$omp$parallel:24@/root/work/apps/OMP/SP/y_solve.f:27:310                 0.806s
pinvr_$omp$parallel:24@/root/work/apps/OMP/SP/pinvr.f:20:41                      0.697s

If Potential Gain for a region is significant, you can go deeper and analyze inefficiency metrics like Imbalance by barriers. Use the following command:

$ amplxe-cl -report hotspots -result-dir r001hpc -group-by=region,barrier -sort-desc="OpenMP Potential Gain:Self" -column="OpenMP Potential Gain" -limit 5

where

The command above produces the following output:


OpenMP Region                                                    OpenMP Barrier-to-Barrier Segment                                   OpenMP Potential Gain  OpenMP Potential Gain:Imbalance  OpenMP Potential Gain:Lock Contention  OpenMP Potential Gain:Creation  OpenMP Potential Gain:Scheduling  OpenMP Potential Gain:Reduction  OpenMP Potential Gain:Atomics  OpenMP Potential Gain:Tasking  OpenMP Potential Gain:Other
---------------------------------------------------------------  ------------------------------------------------------------------  ---------------------  -------------------------------  -------------------------------------  ------------------------------  --------------------------------  -------------------------------  -----------------------------  -----------------------------  ---------------------------
compute_rhs_$omp$parallel:24@/root/work/OMP/SP/rhs.f:17:433       compute_rhs_$omp$loop_barrier_segment@/root/work/OMP/SP/rhs.f:285                 0.985s                           0.982s                                     0s             0s                                0.000s                               0s                             0s                             0s                 0.003s
x_solve_$omp$parallel:24@/home/root/work/OMP/SP/x_solve.f:27:315  x_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/x_solve.f:315                 0.920s                           0.904s                                 0.012s         0.000s                            0.000s                               0s                             0s                             0s                 0.004s
z_solve_$omp$parallel:24@/root/work/OMP/SP/z_solve.f:31:321       z_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/z_solve.f:321                 0.913s                           0.910s                                 0.000s         0.000s                            0.000s                               0s                             0s                             0s                 0.003s
y_solve_$omp$parallel:24@/root/work/OMP/SP/y_solve.f:27:310       y_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/y_solve.f:310                 0.806s                           0.803s                                 0.000s         0.000s                                0s                               0s                             0s                             0s                 0.002s

Analyze the OpenMP Potential Gain columns data that shows a breakdown of Potential Gain in the region by representing the cost (in elapsed time) of the inefficiencies with a normalization by the number of OpenMP threads. Elapsed time cost helps decide whether you need to invest into addressing a particular type of inefficiency. VTune Amplifier can recognize the following types of inefficiencies:

Limitations

VTune Amplifier supports the analysis of parallel OpenMP regions with the following limitations:

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

See Also