Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help

Analyzing MPI Applications to Improve Performance

Parallel High Performance Computing (HPC) applications often rely on multi-node architectures of modern clusters. Performance tuning of such applications must involve analysis of cross-node application behavior as well as single-node performance analysis. Intel® Parallel Studio Cluster Edition includes such performance analysis tools as MPI Performance Snapshot, Intel Trace Analyzer and Collector, and Intel VTune™ Amplifier that can provide important insights to help in MPI application performance analysis. For example:

Note

The version of the Intel MPI library included with the Intel Parallel Studio Cluster Edition makes an important switch to use the Hydra process manager by default for mpirun. This provides high scalability across the big number of nodes.

This topic focuses on how to use the VTune Amplifier command line tool to analyze an MPI application. Refer to the Additional Resources section below to learn more about other analysis tools.

Use the VTune Amplifier for a single-node analysis including threading when you start analyzing hybrid codes that combine parallel MPI processes with threading for a more efficient exploitation of computing resources. HPC Performance Characterization analysis is a good starting point to understand CPU utilization, memory access, and vectorization efficiency aspects and define the tuning strategy to address performance gaps. The CPU Utilization section contains the MPI Imbalance metric, which is calculated for MPICH-based MPIs. Further steps might include Intel Trace Analyzer and Collector to look at MPI communication efficiency, Memory Access analysis to go deeper on memory issues, General Exploration analysis to explore microarchitecture issues, or Intel Advisor to dive into vectorization tuning specifics.

Use these basic steps required to analyze MPI applications for imbalance issues with the VTune Amplifier:

  1. Configure installation for MPI analysis.

  2. Configure and run MPI analysis with the VTune Amplifier.

  3. Resolve symbols for MPI modules.

  4. View collected data.

Explore additional information on MPI analysis:

Configuring Installation for MPI Analysis

For MPI application analysis on a Linux* cluster, you may enable the Per-user Hardware Event-based Sampling mode when installing the Intel Parallel Studio Cluster Edition. This option ensures that during the collection the VTune Amplifier collects data only for the current user. Once enabled by the administrator during the installation, this mode cannot be turned off by a regular user, which is intentional to preclude individual users from observing the performance data over the whole node including activities of other users.

After installation, you can use the respective -vars.sh files to set up the appropriate environment (PATH, MANPATH) in the current terminal session.

Configuring MPI Analysis with the VTune Amplifier

To collect performance data for an MPI application with the VTune Amplifier, use the command line interface (amplxe-cl). The collection configuration can be completed with the help of the Analysis Target configuration options in the VTune Amplifier user interface. For more information, see Arbitrary Targets Configuration.

Usually, MPI jobs are started using an MPI launcher such as mpirun, mpiexec, srun, aprun, etc. The examples provided use mpirun. A typical MPI job uses the following syntax:

mpirun [options] <program> [<args>]

VTune Amplifier is launched using <program> and your application is launched using the VTune Amplifier command arguments. As a result, launching an MPI application using VTune Amplifier uses the following syntax:

mpirun [options] amplxe-cl [options] <program> [<args>]

There are several options for mpirun and amplxe-cl that must be specified or are highly recommended while others can use the default settings. A typical command uses the following syntax:

mpirun -n <n> -l amplxe-cl -quiet -collect <analysis_type> -trace-mpi -result-dir <my_result> my_app [<my_app_options>]

The mpirun options include:

The amplxe-cl options include:

If a MPI application is launched on multiple nodes, VTune Amplifier creates a number of result directories per compute node in the current directory, named as my_result.<hostname1>, my_result.<hostname2>, ... my_result.<hostnameN>, encapsulating the data for all the ranks running on the node in the same directory. For example, the Advanced Hotspots analysis run on 4 nodes collects data on each compute node:

> mpirun -n 16 –ppn 4 –l amplxe-cl -collect advanced-hotspots -trace-mpi -result-dir my_result -- my_app.a

Each process data is presented for each node they were running on:

my_result.host_name1 (rank 0-3)
my_result.host_name2 (rank 4-7)
my_result.host_name3 (rank 8-11)
my_result.host_name4 (rank 12-15)

If you want to profile particular ranks (for example, outlier ranks defined by MPI Performance Snapshot), use selective rank profiling. Use multi-binary MPI run and apply VTune Amplifier profiling for the ranks of interest. This significantly reduces the amount of data required to process and analyze. The following example collects Advanced Hotspots analysis data for 2 out of 16 processes with 1 rank per node:

export VTUNE_CL=amplxe-cl -collect memory-access -trace-mpi -result-dir my_result
$ mpirun -host myhost1 -n 7 my_app.a : -host myhost1 -n 1 $VTUNE_CL -- my_app.a :-host myhost2 -n 7 my_app.a : -host myhost2 -n 1 $VTUNE_CL -- my_app.a

Alternatively, you can create a configuration file with the following content:

# config.txt configuration file
-host myhost1 -n 7 ./a.out
-host myhost1 -n 1 amplxe-cl -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out
-host myhost2 -n 7 ./a.out
-host myhost2 -n 1 amplxe-cl -quiet -collect memory-access -trace-mpi -result-dir my_result ./a.out

To run the collection using the configuration file, use the following command:

> mpirun -configfile ./config.txt

If you use Intel MPI with version 5.0.2 or later you can use the -gtool option with the Intel MPI process launcher for easier selective rank profiling:

> mpirun -n <n> -gtool "amplxe-cl -collect <analysis type> -r <my_result>:<rank_set>" <my_app> [my_app_options]

where <rank_set> specifies a ranks range to be involved in the tool execution. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks.

For example:

> mpirun -gtool "amplxe-cl -collect memory-access -result-dir my_result:7,5" my_app.a

Examples:

  1. This example runs the HPC Performance Characterization analysis type (based on the sampling driver), which is recommended as a starting point:

    > mpirun -n 4 amplxe-cl -result-dir my_result -collect hpc-performance -- my_app [my_app_options]

  2. This example collects the Advanced Hotspots data for two out of 16 processes run on myhost2 in the job distributed across the hosts:

    > mpirun -host myhost1 -n 8 ./a.out : -host myhost2 -n 6 ./a.out : -host myhost2 -n 2 amplxe-cl -result-dir foo -c advanced-hotspots ./a.out

    As a result, the VTune Amplifier creates a result directory in the current directory foo.myhost2 (given that process ranks 14 and 15 were assigned to the second node in the job).

  3. As an alternative to the previous example, you can create a configuration file with the following content:

    # config.txt configuration file
    -host myhost1 -n 8 ./a.out
    -host myhost2 -n 6 ./a.out
    -host myhost2 -n 2 amplxe-cl -quiet -collect advanced-hotspots -result-dir foo ./a.out

    and run the data collection as:

    > mpirun -configfile ./config.txt

    to achieve the same result as in the previous example: foo.myhost2 result directory is created.

  4. This example runs the Memory Access analysis with memory object profiling for all ranks on all nodes:

    > mpirun n 16 -ppn 4 amplxe-cl -r my_result -collect memory-access -knob analyze-mem-objects=true -my_app [my_app_options]

  5. This example runs Advanced Hotspots analysis on ranks 1, 4-6, 10:

    > mpirun –gtool “amplxe-cl -r my_result -collect advanced-hotspots: 1,4-6,10” –n 16 -ppn 4 my_app [my_app_options]

Note

The examples above use the mpirun command as opposed to mpiexec and mpiexec.hydra while real-world jobs might use the mpiexec* ones. mpirun is a higher-level command that dispatches to mpiexec or mpiexec.hydra depending on the current default and options passed. All the listed examples work for the mpiexec* commands as well as the mpirun command.

Resolving Symbols for MPI Modules

After data collection, the VTune Amplifier automatically finalizes the data (resolves symbols and converts them to the database). It happens on the same compute node where the command line collection was executing. So, the VTune Amplifier automatically locates binary and symbol files. In cases where you need to point to symbol files stored elsewhere, adjust the search settings using the -search-dir option:

> mpirun -np 128 amplxe-cl -q -collect hotspots -search-dir /home/foo/syms ./a.out

Viewing Collected Data

Once the result is collected, you can open it in the graphical or command line interface of the VTune Amplifier.

To view the results in the command line interface:

Use the -report option. To get the list of all available VTune Amplifier reports, enter amplxe-cl-help report.

To view the results in the graphical interface:

Click the menu button and select Open > Result... and browse to the required result file (*.amplxe).

Tip

You may copy a result to another system and view it there (for example, to open a result collected on a Linux* cluster on a Windows* workstation).

VTune Amplifier classifies MPI functions as system functions similar to Intel Threading Building Blocks (Intel TBB) and OpenMP* functions. This approach helps you focus on your code rather than MPI internals. You can use the VTune Amplifier GUI Call Stack Mode filter bar combo box and CLI call-stack-mode option to enable displaying the system functions and thus view and analyze the internals of the MPI implementation. The call stack mode User functions+1 is especially useful to find the MPI functions that consumed most of CPU Time (Basic Hotspots analysis) or waited the most (Locks and Waits analysis). For example, in the call chain main() -> foo() -> MPI_Bar() -> MPI_Bar_Impl() -> ..., MPI_Bar() is the actual MPI API function you use and the deeper functions are MPI implementation details. The call stack modes behave as follows:

Note

VTune Amplifier prefixes the profile version of MPI functions with P, for example: PMPI_Init.

VTune Amplifier provides Intel TBB and OpenMP support. You are recommended to use these thread-level parallel solutions in addition to MPI-style parallelism to maximize the CPU resource usage across the cluster, and to use the VTune Amplifier to analyze the performance of that level of parallelism. The MPI, OpenMP, and Intel TBB features in the VTune Amplifier are functionally independent, so all usual features of OpenMP and Intel TBB support are applicable when looking into a result collected for an MPI process. For hybrid OpenMP and MPI applications, the VTune Amplifier displays a summary table listing top MPI ranks with OpenMP metrics sorted by MPI Busy Wait from low to high values. The lower the Communication time is, the longer a process was on a critical path of MPI application execution. For deeper analysis, explore Interpreting OpenMP* Analysis Data by MPI processes laying on the critical path.

Example:

This example displays the performance report for functions and modules analyzed for any analysis type. Note that this example opens per-node result directories (result_dir.host1, result_dir.host2) and groups data by processes -mpi ranks encapsulated in the per-node result:

> amplxe-cl -R hotspots -group-by process,function -r result_dir.host1

> amplxe-cl -R hotspots -group-by process,module -r result_dir.host2

MPI Implementations Support

You can use the VTune Amplifier to analyze both Intel MPI library implementation and other MPI implementations. But beware of the following specifics:

MPI System Modules Recognized by the VTune Amplifier

VTune Amplifier uses the following regular expressions in the Perl syntax to classify MPI implementation modules:

Note

This list is provided for reference only. It may change from version to version without any additional notification.

Analysis Limitations

Additional Resources

For more details on analyzing MPI applications, see the Intel Parallel Studio Cluster Edition and online MPI documentation at http://software.intel.com/en-US/articles/intel-mpi-library-documentation/. For information on installing VTune Amplifier in a cluster environment, see the Intel VTune Amplifier XE Installation Guide for Linux.

There are also other resources available online that discuss usage of the VTune Amplifier with other Parallel Studio Cluster Edition tools:

See Also