Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
Intel® VTune™ Amplifier provides a low-overhead user-mode sampling and tracing and hardware event-based sampling analysis of the JIT compiled code executed with Oracle* JDK or OpenJDK*. The analysis of the interpreted Java methods is limited.
You may use the hardware event-based sampling data collection that monitors hardware events in the CPU's pipeline and can identify coding pitfalls limiting the most effective execution of instructions in the CPU. The hardware performance metrics are available and can be displayed against the application modules, functions, and Java code source lines. You may also run the hardware event-based sampling collection with stacks when you need to find out a call path for a function called in a driver or middleware layer in your system.
Use the following syntax to configure Java analysis from the command line:
$ amplxe-cl -collect <analysis_type> [-[no-]follow-child] [-mrte-mode=<mrte_mode_value>] [<-knob> <knob_name=knob_option>] [--] <target>
where
To see all knobs available for a predefined analysis type, enter:
$ amplxe-cl -help collect <analysis_type>
To see knobs for a custom analysis type, enter:
$ amplxe-cl -help collect-with <analysis_type>
Example 1: Running Java Analysis
The following command line runs the Advanced Hotspots analysis on a java command:
$ amplxe-cl -collect advanced-hotspots -- java -Xcomp -Djava.library.path=native_lib/ia32 -cp /home/Design/Java/mixed_call MixedCall 3 2
Example 2: Running Analysis for Embedded Java Command
You may embed your java command in a batch file or executable script before running the analysis. For example, create a run.sh file with the following command:
java -Xcomp -Djava.library.path=native_lib/ia32 -cp /home/Design/Java/mixed_call MixedCall 3 1
The following command line runs the Basic Hotspots analysis on a specified batch file with embedded java command:
$ amplxe-cl -collect hotspots -- run.sh
Example 3: Attaching Analysis to Java Process
In case your Java application needs to run for some time or cannot be launched at the start of this analysis, you may attach the VTune Amplifier to the Java process. To do this, specify the following analysis target: --target-process java.
The dynamic attach mechanism is supported only with the Java Development Kit (JDK).
The following example attaches the Advanced Hotspots analysis to a running Java process:
$ amplxe-cl -collect advanced-hotspots --target-process java
VTune Amplifier automatically generates the summary report when data collection completes. Similar to the Summary window, available in GUI, the command line report provides overall performance data of your Java target.
For more information on analyzing the summary report data, refer to the Summary Report section.
Examples
The following example generates the summary report for the Basic Hotspots analysis result. For user-mode sampling and tracing analysis results, the summary report includes Collection and Platform information, CPU information and summary per the basic metrics.
Collection and Platform Info
----------------------------
Parameter r002hs
------------------------ -----------------------------------------------------
-------------------------------------------------------------------------------
Application Command Line /tmp/java_mixed_call/src/run.sh
Operating System 3.16.0-30-generic NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
Computer Name 10.125.21.55
Result Size 11560723
Collection start time 13:55:00 05/02/2016 UTC
Collection stop time 13:55:10 05/02/2016 UTC
CPU
---
Parameter r001hs
----------------- -------------------------------------------------
Name 3rd generation Intel® Core™ Processor family
Frequency 3492067692
Logical CPU Count 8
Summary
-------
Elapsed Time: 10.183
CPU Time: 19.200
Average CPU Usage: 1.885
This example generates the summary report for the Advanced Hotspots analysis result. For hardware event-based sampling analysis results, the summary report includes Collection and Platform information, CPU information, summary per the basic metrics, and an event summary.
Collection and Platform Info
----------------------------
Parameter r002ah
------------------------ ---------------------------------------------------------------------------------
Operating System 3.16.0-30-generic NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
Result Size 171662827
Collection start time 10:44:34 15/04/2016 UTC
Collection stop time 10:44:50 15/04/2016 UTC
CPU
---
Parameter r002ah
----------------- -------------------------------------------------
Name 4th generation Intel® Core™ Processor family
Frequency 2494227445
Logical CPU Count 4
Summary
-------
Elapsed Time: 15.463
CPU Time: 6.392
Average CPU Usage: 0.379
CPI Rate: 1.318
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
-------------------------- ------------------------- -------------------------------- -----------------
INST_RETIRED.ANY 13014608235 8276 1900000
CPU_CLK_UNHALTED.THREAD 17158609921 8207 1900000
CPU_CLK_UNHALTED.REF_TSC 15942400300 5163 1900000
BR_INST_RETIRED.NEAR_TAKEN 1228364727 4648 200003
CALL_COUNT 213650621 75413 1
ITERATION_COUNT 370567815 84737 1
LOOP_ENTRY_COUNT 162943310 70069 1
Use the hotspots command line report as a starting point for identifying program units (for example: functions, modules, or objects) that take the most processor time (Hotspots analysis), underutilize available CPUs (Concurrency analysis), have long waits (Locks and Waits analysis), and so on.
The report displays the hottest program units in the descending order by default, starting from the most performance-critical unit. The command-line reports provide the same data that is displayed in the default GUI analysis viewpoints.
Examples
This example generates the hotspots report for the Basic Hotspots analysis result and groups the data by module. The result file is not specified and VTune Amplifier uses the latest analysis result.
$ amplxe-cl -report hotspots
Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Module Function (Full) Source File Start Address
------------------ -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- ------------------ ---------------------- ---------------- ------------------ ----------- -------------
[libmixed_call.so] 17.180s 17.180s 0s 17.180s 0s 0s 0s 0s 0s libmixed_call.so [libmixed_call.so] [Unknown] 0
[libjvm.so] 1.698s 1.698s 0.020s 1.678s 0s 0s 0s 0s 0s libjvm.so [libjvm.so] [Unknown] 0
[libpthread.so.0] 0.136s 0.136s 0s 0.136s 0s 0s 0s 0s 0s libpthread.so.0 [libpthread.so.0] [Unknown] 0
[libtpsstool.so] 0.052s 0.052s 0s 0.052s 0s 0s 0s 0s 0s libtpsstool.so [libtpsstool.so] [Unknown] 0
...
The following example generates the hotspots report for the specified Advanced Hotspots analysis result, sets the number of items to include in the report to 3, and groups the report data by application module.
$ amplxe-cl -report hotspots -limit 3 -r r002ah -group-by module
Module CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Instructions Retired CPI Rate Wait Rate CPU Frequency Ratio Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Path
---------------- -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- ------------------ ---------------------- -------------------- -------- --------- ------------------- ------------------- ----------------------------- --------------------------------- -------------------- ------------------------------- ------------------------------------ -----------
libmixed_call.so 15.294s 15.294s 0.419s 14.871s 0.004s 0s 0s 0s 0s 21,148,958,284 1.907 0.000 1.149 1.401s 0s 1.401s 26,769 26,769 0 /tmp/java_mixed_call/src/libmixed_call.so
libjvm.so 0.582s 0.582s 0.033s 0.547s 0.002s 0s 0s 0s 0s 792,807,896 1.513 0.437 0.899 0.047s 0.005s 0.042s 462 451 11 /tmp/java_mixed_call/src/libmjvm.so
...
...
To get the maximum performance out of your Java application, writing and compiling performance critical modules of your Java project in native languages, such as C or even assembly. This will help your application take advantage of vectorization and make complete use of powerful CPU resources. This way of programming helps to employ powerful CPU resources like vector computing (implemented via SIMD units and instruction sets). In this case, compute-intensive functions become hotspots in the profiling results, which is expected as they do most of the job. However, you might be interested not only in hotspot functions, but in identifying locations in Java code these functions were called from via a JNI interface. Tracing such cross-runtime calls in the mixed language algorithm implementations could be a challenge.
Use the callstacks report to display full stack data for each hotspot function and identify the impact of each stack on the function CPU or Wait time.
To display a list of available groupings for a callstacks report, enter amplxe-cl -report callstacks -r <result_dir> group-by=?.
Example
The following command line generates the callstacks report for the specified Basic Hotspots analysis result.
Function Function Stack CPU Time Module Function (Full) Source File Start Address
------------------ ------------------------- -------- -------------------- ------------------------------ -------------- --------------
[libmixed_call.so] 17.180s libmixed_call.so [libmixed_call.so] [Unknown] 0
[libmixed_call.so] 8.600s libmixed_call.so [libmixed_call.so] [Unknown] 0
MixedCall::CallNativeFunc 0s [Compiled Java code] MixedCall::CallNativeFunc(int) MixedCall.java 0x7fb63937eec0
MixedCall::foo4 0s [Compiled Java code] MixedCall::foo4(int) MixedCall.java 0x7fb6393831e3
MixedCall::foo3 0s [Compiled Java code] MixedCall::foo3(int) MixedCall.java 0x7fb63938046c
MixedCall::foo2 0s [Compiled Java code] MixedCall::foo2(int) MixedCall.java 0x7fb63938046c
MixedCall::foo1 0s [Compiled Java code] MixedCall::foo1(int) MixedCall.java 0x7fb63938046c
MixedCall::run 0s [Compiled Java code] MixedCall::run() MixedCall.java 0x7fb63938009b
...
VTune Amplifier provides an advanced profiling option of optimizing Java applications for the CPU microarchitecture utilized in your platform. Although Java and JVM technology is intended to free a developer from hardware architecture specific coding, once Java code is optimized for the current Intel microarchitecture, it will most probably keep this advantage for future generations of CPUs.
VTune Amplifier counts the number of hardware events during the hardware event-based sampling collection to help you understand how your Java application utilizes available hardware resources. Use the hw-events report type to display hardware events count per application functions in the descending order by default.To display a list of available groupings for a hw-events report, enter amplxe-cl -report hw-events -r <result_dir> group-by=?.
Example
This example generates the hw-events report for the specified Advanced Hotspots analysis result.
Function Hardware Event Count:INST_RETIRED.ANY Hardware Event Count:CPU_CLK_UNHALTED.THREAD Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Function (Full) Source File Start Address
------------------ ------------------------------------- -------------------------------------------- --------------------------------------------- ------------------- ----------------------------- --------------------------------- -------------------- ------------------------------- ------------------------------------ ------------------ ------------------ ----------- -------------
[libmixed_call.so] 21,148,958,284 40,338,264,445 35,096,009,324 1.401s 0s 1.401s 26,769 26,769 0 [libmixed_call.so] [libmixed_call.so] [Unknown] 0
[libjvm.so] 792,807,896 1,199,773,286 1,335,034,092 0.047s 0.005s 0.042s 462 451 11 [libjvm.so] [libjvm.so] [Unknown] 0
...
VTune Amplifier supports analysis of Java applications with some limitations:
System-wide profiling is not supported for managed code.
The JVM interprets some rarely called methods instead of compiling them for the sake of performance. VTune Amplifier does not recognize interpreted Java methods and marks such calls as !Interpreter in the restored call stack.
If you want such functions to be displayed in stacks with their names, force the JVM to compile them by using the -Xcomp option (show up as [Compiled Java code] methods in the results). However, the timing characteristics may change noticeably if many small or rarely used functions are being called during execution.
When opening source code for a hotspot, the VTune Amplifier may attribute events or time statistics to an incorrect piece of the code. It happens due to JDK Java VM specifics. For a loop, the performance metric may slip upward. Often the information is attributed to the first line of the hot method's source code.
Consider events and time mapping to the source code lines as approximate.
For the Basic Hotspots analysis type, the VTune Amplifier may display only a part of the call stack. To view the complete stack on Linux, use additional command line JDK Java VM options that change behavior of the Java VM:
Use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for better quality of stack walking.
On Linux* x86, use client JDK Java VM instead of the server Java VM: either explicitly specify -client, or simply do not specify -server JDK Java VM command line option.
On Linux x64, specify -XX:-UseLoopCounter command line option that switches off on-the-fly substitution of the interpreted method with the compiled version.
Java application profiling is supported for the Basic Hotspots, Advanced Hotspots, and Microarchitecture analysis types. Support for the Concurrency and Locks and Waits analysis is limited as some embedded Java synchronization primitives (which do not call operating system synchronization objects) cannot be recognized by the VTune Amplifier . As a result, some of the timing metrics may be distorted.
There are no dedicated libraries supplying a user API for collection control in the Java source code. However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.