What's New in VTune Amplifier for Systems

VTune Amplifier 2017 Update 1 for Systems

Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view
Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots
Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
Support for the Average Latency metric in the Memory Access analysis based on the driverless collection
PREVIEW : New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth
New Render and GPGPU Packet Stage grouping levels that help analyze the CPU activity while the GPU Execution Units were either idle or executing some code
QPI Bandwidth data in the Memory Usage viewpoint displayed as data and non-data to better differentiate the QPI activity for applications that are not transmitting any data between sockets
Analysis support for Java* applications executed with OpenJDK

VTune Amplifier 2017 for Systems

Support for Intel® Xeon Phi™ processor codenamed Knights Landing and Intel® Xeon® Processor E5 v4 Family (formerly codenamed Broadwell EP), including General Exploration, Memory Access (including high bandwidth analysis), and HPC Performance Characterization analysis
Disk Input and Output analysis that monitors utilization of the disk subsystem, CPU and PCIe buses, helps identify long latency of I/O requests and imbalance between I/O and compute operations. Use the Analyzing Input/Output Waits tutorial for a hands-on exercise with the sample code.
Memory Access analysis improvements:
- Optimized workflow for identifying top memory objects/functions with high bandwidth utilization per domain (DRAM, QPI, and so on) starting from the Summary Bandwidth Utilization section with a direct navigation to more details in the Bottom-up window
- Automatic detection of maximum system DRAM bandwidth characteristics. This option helps understand how you utilize the available DRAM bandwidth.
- Better representation of global memory objects, which now includes a variable name and, in case of structure/class types, field names
- Support for custom memory allocators via Memory Allocation API that help correctly determine memory objects
- Driverless event-based sampling collection for uncore events enabled for the Memory Access analysis
- Identifying False Sharing tutorial providing a hands-on exercise for running Memory Access analysis for a sample application to identify and remove false sharing issues
HPC workloads profiling improvements:
- HPC Performance Characterization analysis that explores the following performance aspects of the application scalability: CPU utilization with parallel efficiency for MPI and OpenMP*, memory access efficiency and FPU utilization with basic vectorization metrics
- MPI analysis extended with the event-based sampling collection supported for multiple ranks per node with an arbitrary MPI launcher and natural syntax. Arbitrary targets command line configuration extended with MPI launcher options. You can now use the Copy Command Line to Clipboard dialog box to automatically generate a command line for MPI analysis from GUI.
- An option enabling/disabling the OpenMP* regions analysis added to selected analysis configurations
- An option controlling result finalization, -finalization-mode, that enables you to perform a full finalization on the target, defer or skip the finalization. The deferred finalization mode is especially useful on target platforms with a single-thread performance lower than on the host. In this mode, the VTune Amplifier calculates a binary checksum to match the binaries for finalization on the host machine.
- Analyzing an OpenMP and MPI Application web-based tutorial providing a hands-on exercise to identify memory utilization inefficiencies and load imbalance for a sample hybrid application
More languages support:
- Python* applications profiling with Basic Hotspots analysis running via the Launch Application or Attach to Process modes
- Go* applications profiling with hardware event-based analysis types
GPU analysis improvements:
- GPU Hotspots analysis targeted for GPU-bound applications and providing options to analyze execution of OpenCL™ kernels and Intel Media SDK tasks
- GPU analysis Summary introducing a set of metrics to estimate GPU utilization per engine, identify stalled or idle Execution Units and explore the most typical problems with low occupancy or frequent sampler accesses
- Navigation from the Hottest GPU computing tasks summary to the details provided in the Graphics tab
- Support for the Attach to Process target analysis for Intel Media SDK and OpenCL™ programs
- Detection of the OpenCL™ 2.0 Shared Virtual Memory (SVM) usage types per kernel instance
Usability improvements:
- Support for the Attach to Process target analysis with the event-based sampling for low privileged Java* daemons on Linux*
- Event selection mechanism for custom hardware event-based sampling analysis extended with filtering options
- Arbitrary target GUI configuration to generate a command line for performance analysis on a system that is not accessible from the current host
- UI improvements for the grid views and identification of performance issues
- Data collection limit extended with the Ring Buffer mode to enable the analysis only for the last seconds before the target run or collection is terminated by setting up the expected collection time
- Better microarchitecture issues localization in the Source/Assembly view that enables easier navigation to source lines or instructions contributed the most to a certain performance issue
- Improved identification of vPMU configuration issues. Additional information about setting up VTune Amplifier with a virtual machine is available from Using VTune Amplifier with a Virtual Machine.
Intel Performance Snapshot (Preview) introducing the following tools as part of the VTune Amplifier:
- Application Performance Snapshot tool provides a quick look at your application performance and helps you understand whether your application will benefit from tuning. It identifies how effectively your application uses the hardware platform and displays basic performance enhancement opportunities.
- Storage Performance Snapshot tool analyzes your system's storage, CPU, memory, and network usage and displays basic performance enhancement opportunities for systems using Intel hardware.
Note

A PREVIEW FEATURE may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases. Please send your feedback to parallel.studio.support@intel.com.
Support for Fedora* 23 and 24, Ubuntu* 15.10 and 16.04
Support for Linux* kernel up to 4.4

What's New in VTune Amplifier for Systems

VTune Amplifier 2017 Update 1 for Systems

VTune Amplifier 2017 for Systems

Note

See Also