Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
Use the Intel® VTune™ Amplifier's Memory Access analysis to identify memory-related issues, like NUMA problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures), which is provided due to instrumentation of memory allocations/de-allocations and getting static/global variables from symbol information.
Memory Access analysis type uses hardware event-based sampling to collect data for the following metrics:
Loads and Stores metrics that show the total number of loads and stores
LLC Miss Count metric that shows the total number of last-level cache misses
Local DRAM Access Count metric that shows the total number of LLC misses serviced by the local memory
Remote DRAM Access Count metric that shows the number of accesses to the remote socket memory
Remote Cache Access Count metric that shows the number of accesses to the remote socket cache
Memory Bound metric that shows a fraction of cycles spent waiting due to demand load or store instructions
L1 Bound metric that shows how often the machine was stalled without missing the L1 data cache
L2 Bound metric that shows how often the machine was stalled on L2 cache
L3 Bound metric that shows how often the CPU was stalled on L3 cache, or contended with a sibling core
L3 Latency metric that shows a fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)
DRAM Bound metric that shows how often the CPU was stalled on the main memory (DRAM)
Remote / Local DRAM Ratio metric that is defined by the ratio of remote DRAM loads to local DRAM loads
Local DRAM metric that shows how often the CPU was stalled on loads from the local memory
Remote DRAM metric that shows how often the CPU was stalled on loads from the remote memory
Remote cache metric that shows how often the CPU was stalled on loads from the remote cache in other sockets
Average Latency metric that shows an average load latency in cycles
The list of metrics may vary depending on your microarchitecture.
Many of the collected events used in the Memory Access analysis are precise. This simplifies understanding the data access pattern. Off-core traffic is divided into the local DRAM and remote DRAM accesses. Typically, you should focus on minimizing remote DRAM accesses that usually have a high cost.
To use the Memory Access analysis type, explore:
To configure options for the Memory Access analysis:
Click the New Analysis toolbar button.
The Analysis Type window opens.
From the left pane, select Microarchitecture Analysis > Memory Access.
The Memory Access configuration pane opens on the right.
Configure the following options:
CPU sampling interval, ms field |
Specify an interval (in milliseconds) between CPU samples. Possible values - 0.01-1000. The default value is 1. |
Analyze dynamic memory objects check box |
Enable the instrumentation of dynamic memory allocation/de-allocation and map hardware events to such memory objects. This option may cause additional runtime overhead due to the instrumentation of all system memory allocation/de-allocation API. The default value is false. |
Minimal dynamic memory object size to track, in bytes spin box |
Specify a minimal size of dynamic memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation. The default value is 1024. |
Evaluate max DRAM bandwidth check box |
Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds. The default value is true. |
Analyze OpenMP regions check box |
Instrument and analyze OpenMP regions to detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction and atomic operations. The default value is false. |
Details button |
Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify these settings for the analysis, you need to create a custom configuration by right-clicking the analysis entry in the analysis tree and selecting Copy from Current from the context menu. VTune Amplifier creates an editable copy of this analysis type configuration and locates it under the Custom Analysis branch in the analysis tree. |
Limitations:
Memory objects analysis can be configured for Linux* targets only and only for processors based on Intel microarchitecture code name Sandy Bridge or later.
Memory objects analysis is not supported for the driverless event-based sampling analysis based on the Perf* tool.
By default, the Memory Access analysis data shows up in the Memory Usage viewpoint that includes the following windows:
Summary window displays statistics on the overall application execution, including the application-level bandwidth utilization histogram.
Bottom-up window displays performance data per metric for each hotspot object. If you enable the Analyze memory objects option for data collection, the Bottom-up window also displays memory allocation call stacks in the grid and Call Stack pane. Use the Memory Object grouping level, preceded with the Function level, to view memory objects as the source location of an allocation call.
Platform window provides details on tasks specified in your code with the Task API, Ftrace*/Systrace* event tasks, OpenCL™ API tasks, and so on. If corresponding platform metrics are collected, the Platform window displays overtime data as GPU usage on a software queue, CPU time usage, OpenCL™ kernels data, and GPU performance per the Overview group of GPU hardware metrics, Memory Bandwidth, and CPU Frequency.
Memory Access analysis is supported on the following platforms:
2nd Generation Intel® Core™ processors
Intel® Xeon® processor families, or later
3rd Generation Intel® Atom™ processor family, or later
If you need to analyze older processors, you can create a custom analysis and choose events related to memory accesses. However, you will be limited to memory-related events available on those processors. For information about memory access events per processor, see the VTune Amplifier tuning guides.