Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
HPC Performance Characterization analysis helps identify how effectively your compute-intensive application uses CPU, memory, and floating-point operation hardware resources. The HPC Performance Characterization analysis type can be used as a starting point for understanding the performance aspects of your application. Additional scalability metrics are available for applications that use Intel OpenMP* or Intel MPI runtime libraries. The analysis can be run from within the VTune Amplifier GUI or from the command line.
During HPC Performance Characterization analysis, the Intel® VTune™ Amplifier data collector profiles your application using event-based sampling collection. OpenMP analysis metrics for Intel OpenMP runtime library are based on User API instrumentation enabled in the runtime library.
Typically the collector will gather data for a specified application, but it can collect system-wide performance data with limited detail if required.
FPU and GFLOPS metrics are supported on 3rd Generation Intel Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors. Limited support is available for Intel® Xeon Phi™ processors formerly code named Knights Landing. The metrics are not currently available on 4th Generation Intel processors. Expand the Details section on the analysis configuration pane to view the processor family available on your system.
To use the HPC Performance Characterization analysis, explore:
Configuration options (knobs)
To configure options for the HPC Performance Characterization analysis:
Click the New Analysis button on the Intel® VTune™ Amplifier toolbar.
The New Amplifier Result tab opens with the Analysis Type tab active.
Select the Compute-Intensive Application Analysis > HPC Performance Characterization analysis type from the analysis tree on the left pane.
The HPC Performance Characterization pane opens on the right.
Configure the following options:
CPU sampling interval, ms field |
Specify an interval (in milliseconds) between CPU samples. Possible values - 0.01-1000. The default value is 1. |
Collect stacks check box |
Enable advanced collection of call stacks and thread context switches. The default value is false. |
Analyze memory bandwidth check box |
Collect the data required to compute memory bandwidth. The default value is true. |
Evaluate max DRAM bandwidth check box |
Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds. The default value is true. |
Analyze OpenMP regions check box |
Instrument and analyze OpenMP regions to detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations. The default value is true. |
Details button |
Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify these settings for the analysis, you need to create a custom configuration by right-clicking the analysis entry in the analysis tree and selecting Copy from Current from the context menu. VTune Amplifier creates an editable copy of this analysis type configuration and locates it under the Custom Analysis branch in the analysis tree. |
You may generate the command line for this configuration using the Command Line... button at the bottom.
You can choose to view HPC Performance Characterization analysis results in any of the following viewpoints:
Viewpoint |
Description |
---|---|
HPC Performance Characterization |
Helps understand how effectively your application uses CPU, memory, and floating-point operation resources. Use this view to identify scalability issues for Intel OpenMP and MPI runtimes as well as next steps to increase memory and FPU efficiency. |
Hardware Events |
Displays statistics of monitored hardware events: estimated count and/or the number of samples collected. Use this view to identify code regions (modules, functions, code lines, and so on) with the highest activity for an event of interest. |
Hardware Issues |
Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware performance counters. Hover over the highlighted metrics values in the grid to read why the extreme value might represent a performance problem. |
Hotspots |
Helps identify hotspots - code regions in the application that consume a lot of CPU time. |
Memory Usage |
Helps understand how effectively your application uses memory resources and identify potential memory access related issues like excessive access to remote memory on NUMA platforms, hitting DRAM or Intel® QuickPath Interconnect (Intel QPI) bandwidth limit, and others. It provides various performance metrics for both the application code and memory objects arrays. |
General Exploration |
Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware events. The Summary window reports the overall metrics for the entire execution along with explanations of the metrics. From the Bottom-up and Top-down Tree windows you can locate the hardware issues in your application. Cells are highlighted when potential opportunities to improve performance are detected. Hover over the highlighted metrics in the grid to see explanations of the issues. |
Each viewpoint consists of the following windows/panes:
Summary window displays statistics on the overall application execution, identifying CPU time and processor utilization.
Bottom-up window displays functions in the bottom-up tree, CPU time and CPU utilization per function.
Top-down Tree window displays functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
Caller/Callee window displays parent and child functions of the selected focus function.
Use the HPC Performance Characterization viewpoint to review the following:
CPU Utilization: Look for scalability problems involving the use of serial time versus parallel time. Identify hotspot functions by CPU utilization.
Memory Bound: Evaluate whether the application is memory bound. To understand deeper problems, run the Memory Access Analysis to identify specific memory objects causing issues.
FPU Utilization: Determine if floating-point loops are bandwidth bound or vectorized. For bandwidth bound loops/functions, run the Memory Access Analysis to reduce bandwidth consumption. For vectorization optimization opportunities, use the Intel Advisor to run a vectorization analysis.
Use the Analyzing an OpenMP* and MPI Application tutorial to review basic steps for tuning a hybrid application. The tutorial is available from the Intel Developer Zone at https://software.intel.com/en-us/itac-vtune-mpi-openmp-tutorial-lin.