General Exploration Analysis

The General Exploration analysis type uses hardware event-based sampling collection. This analysis is a good starting point to triage hardware issues in your application. Once you have used Basic Hotspots or Advanced Hotspots analysis to determine hotspots in your code, you can perform General Exploration analysis to understand how efficiently your code is passing through the core pipeline. During General Exploration analysis, the VTune Amplifier collects a complete list of events for analyzing a typical client application. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems.

To use the General Exploration analysis, explore:

Analysis strategy
Configuration options (knobs)
Viewpoints

General Exploration Analysis Strategy

The General Exploration analysis strategy varies by microarchitecture. For modern microarchitectures starting with Intel microarchitecture code name Ivy Bridge, the General Exploration analysis is based on the Top-Down Microarchitecture Analysis Method using the Top-Down Characterization methodology, which is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application.

Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and decoded into the operations that constitute them, and the back-end, where the required computation is performed. Each cycle, the front-end generates up to four of these operations. It places them into pipeline slots that then move through the back-end. Thus, for a given execution duration in clock cycles, it is easy to determine the maximum number of pipeline slots containing useful work that can be retired in that duration. The actual number of retired pipeline slots containing useful work, though, rarely equals this maximum. This can be due to several factors: some pipeline slots cannot be filled with useful work, either because the front-end could not fetch or decode instructions in time (Front-end bound execution) or because the back-end was not prepared to accept more operations of a certain kind (Back-end bound execution). Moreover, even pipeline slots that do contain useful work may not retire due to bad speculation. Front-end bound execution may be due to a large code working set, poor code layout, or microcode assists. Back-end bound execution may be due to long-latency operations or other contention for execution resources. Bad speculation is most frequently due to branch misprediction.

Each cycle, each core can fill up to four of its pipeline slots with useful operations. Therefore, for some time interval, it is possible to determine the maximum number of pipeline slots that could have been filled in and issued during that time interval. This analysis performs this estimate and breaks up all pipeline slots into four categories:

Pipeline slots containing useful work that issued and retired (Retired)
Pipeline slots containing useful work that issued and cancelled (Bad speculation)
Pipeline slots that could not be filled with useful work due to problems in the front-end (Front-end Bound)
Pipeline slots that could not be filled with useful work due to a backup in the back-end (Back-end Bound)

To use General Exploration analysis, first determine which top-level category dominates for hotspots of interest. You can then dive into the dominating category by expanding its column. There, you can find many issues that may contribute to that category.

You can also run the General Exploration analysis on other microarchitectures that are NOT covered with the Top-Down Method in the VTune Amplifier:

Intel Microarchitecture Code Name Sandy Bridge: This microarchitecture is already partially based on the top-down method and the VTune Amplifier provides a hierarchical analysis of the hardware metrics based on the following categories: Filled Pipeline Slots and Unfilled Pipeline Slots (Stalls).
Intel® Xeon Phi™ Coprocessor (VTune Amplifier XE only): Perform General Exploration analysis to understand how efficiently your code is utilizing the Intel Xeon Phi coprocessor architecture. The Intel Xeon Phi coprocessor is ideally suited for highly parallel applications that feature a high ratio of computation to data access. It is composed of up to 61 CPU cores connected on-die via a bi-directional ring bus. Each core is capable of switching between up to 4 hardware threads in a round-robin manner, resulting in a total of up to 244 hardware threads available. Each core consists of an in-order, dual-issue x86 pipeline, a local L1 and L2 cache, and a separate vector processing unit (VPU). Being an in-order machine, the coprocessor can be sensitive to stalls on memory access so the round-robin scheduling of the threads and aggressive compiler-generated software prefetching are used to mitigate that. It is also important that each hardware thread uses the available vectorization width as much as possible. To provide a dive into possible issues, the General Exploration analysis type provides ability to collect the following groups of metrics:
- L1 cache usage , and estimated maximum latency for L1 cache misses
- L2 cache usage. The data for additional L2 cache events should be used with caution since they include cache misses from software prefetch instructions.
- Vectorization usage
- TLB usage efficiency
All of the metrics in this analysis type measure activity within one Intel Xeon Phi coprocessor. The metrics that you get after collecting one or several groups have programmed thresholds. When the value for the metric is outside the threshold, the cell corresponding to that hotspot will turn pink, giving you a hint when more investigation may be warranted. Memory bandwidth can be calculated using another profile as an additional performance metric. For details on tuning methodology and metrics, see the Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors article.
Intel Microarchitectures Code Name Nehalem and Westmere: During General Exploration analysis on these microarchitectures, the VTune Amplifier collects metrics that help identify such hardware-level performance problems as:
- Front End stall and its causes
- Stalls at execution and retirement: particularly those caused by stalls due to the various high latency loads, wasted work caused by branch misprediction, or long latency instructions.

Note

For a detailed tuning methodology behind the General Exploration analysis and some of the complexities associated with this analysis, see Understanding How General Exploration Works in Intel® VTune™ Amplifier.
For architecture-specific Tuning Guides, visit https://software.intel.com/en-us/articles/processor-specific-performance-analysis-papers.

Configuration Options

To configure options for the General Exploration analysis:

Click the New Analysis button on the Intel® VTune™ Amplifier toolbar.

The New Amplifier Result tab opens with the Analysis Type window active.
From the analysis tree on the left pane, select Microarchitecture Analysis > General Exploration.

The analysis configuration pane opens on the right.

Note

For detailed information on events collected for General Exploration on a particular microarchitecture, refer to the Intel Processor Event Reference.

Configure the following options:

Analyze memory bandwidth check box	Collect the data required to compute memory bandwidth. The default value is false.
Evaluate max DRAM bandwidth check box	Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds. The default value is true.
Analyze OpenMP regions check box	Instrument and analyze OpenMP regions to detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations. The default value is false.
Analyze user tasks, events, and counters check box	Analyze the tasks, events, and counters specified in your code via the ITT API. This option causes a higher overhead and increases the result size. The default value is false.
Details button	Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify these settings for the analysis, you need to create a custom configuration by right-clicking the analysis entry in the analysis tree and selecting Copy from Current from the context menu. VTune Amplifier creates an editable copy of this analysis type configuration and locates it under the Custom Analysis branch in the analysis tree.

For the Intel Xeon Phi coprocessor targets, the following additional options are available:

Analyze general cache usage check box	Analyze the data locality. Good data accessibility makes your code efficient and helps benefit from vectorization. This analysis includes only L1 metrics since L2 and FILL events on the Intel Xeon Phi coprocessor are counting both demand loads and stores as well as multiple types of prefetches. Since other events do not count all of the prefetches accurately, the formulas cannot be adjusted to calculate real demand L2 hits or misses.
Analyze additional L2 cache events check box	Extend cache analysis with adding the L2_DATA_READ/WRITE_MISS_CACHE_FILL and L2_DATA_READ/WRITE_MISS_MEM_FILL events for some insight into L2 data locality. The general ratio of these events may indicate remote cache accesses that have as high a latency as memory accesses and should be avoided if possible.
Analyze TLB misses check box	Translation Lookaside Buffer (TLB) is a cache used for mapping virtual address to physical ones. Analyze the rate of TLB misses to identify performance problems.
Analyze vectorization usage check box	Identify whether the level of vectorization usage (instruction-level parallelism) in your program is sufficient.

Note

You may generate a command line for this configuration using the Command Line... button at the bottom.

Viewpoints

You can choose to view General Exploration analysis results in any of the following viewpoints:

Viewpoint	Description
General Exploration	Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware events. The Summary window reports the overall metrics for the entire execution along with explanations of the metrics. From the Bottom-up and Top-down Tree windows you can locate the hardware issues in your application. Cells are highlighted when potential opportunities to improve performance are detected. Hover over the highlighted metrics in the grid to see explanations of the issues.
Hardware Events	Displays statistics of monitored hardware events: estimated count and/or the number of samples collected. Use this view to identify code regions (modules, functions, code lines, and so on) with the highest activity for an event of interest.
Hardware Issues	Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware performance counters. Hover over the highlighted metrics values in the grid to read why the extreme value might represent a performance problem.
Hotspots	Helps identify hotspots - code regions in the application that consume a lot of CPU time.
Memory Usage	Helps understand how effectively your application uses memory resources and identify potential memory access related issues like excessive access to remote memory on NUMA platforms, hitting DRAM or Intel® QuickPath Interconnect (Intel QPI) bandwidth limit, and others. It provides various performance metrics for both the application code and memory objects arrays.

Viewpoints may include the following windows:

Summary window displays statistics on the overall application execution.
Bottom-up pane displays performance data per metric (event ratio/event count/sample count) for each hotspot function.
Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
Caller/Callee window displays parent and child functions of the selected focus function. This window is available only if stack collection was enabled during analysis configuration.
Event Count window displays an estimated count of PMU events selected for the analysis.
Sample Count window displays the actual number of samples collected for a processor event.
Uncore Event Count window displays a count of uncore events selected for the analysis. If there are no uncore events, the upper pane of the window is empty.
Platform window provides details on tasks specified in your code with the Task API, Ftrace*/Systrace* event tasks, OpenCL™ API tasks, and so on. If corresponding platform metrics are collected, the Platform window displays overtime data as GPU usage on a software queue, CPU time usage, OpenCL™ kernels data, and GPU performance per the Overview group of GPU hardware metrics, Memory Bandwidth, and CPU Frequency

General Exploration Analysis