Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
This section provides reference for hardware events that can be monitored for the CPU(s):
The following performance-monitoring events are supported:
Number of actual bank conflicts
Number of taken and not taken branches, including: conditional branches, jumps, calls, returns, software interrupts, and interrupt returns
Number of branch mispredictions that occurred on BTB hits. BTB misses are not considered branch mispredicts because no prediction exists for them yet.
Number of instruction reads; whether the read is cacheable or noncacheable
The number of cycles (commonly known as clockticks) where any thread on a core is active. A core is active if any thread on that core is not halted. This event is counted at the core level ? at any given time, all the hardware threads running on the same core will have the same value.
Number of dirty lines (all) that are written back, regardless of the cause
Counts misses in the L1 TLB, at the hardware thread level. TLB Misses could have been caused by either demand data loads and stores or data prefetches.
Number of memory data reads which hit the internal data cache (L1). Cache accesses resulting from prefetch instructions are included.
Number of memory read accesses that miss the internal data cache whether or not the access is cacheable or noncacheable. Cache accesses resulting from prefetch instructions are included.
Counts demand data loads and stores that missed the L1 cache, at the hardware thread level. This event does not include misses for cachelines that were in the process of being prefetched into L1. This event does not count data cache misses due to hardware or software prefetches.
Counts demand data loads and stores, at the hardware thread level. This event could also be referred to as L1 data cache accesses. This event does not count data cache accesses due to hardware or software prefetches. It does include VPU loads generated by instructions like vgather/vloadunpack/etc. VPU_DATA_READ and VPU_DATA_WRITE are subsets of this event.
Number of memory data writes which hit the internal data cache (L1).
Number of memory write accesses that miss the internal data cache whether or not the access is cacheable or noncacheable
Counts the number of cycles where an instruction was in execution stage, except in the FP or VPU execution units. Counts at the hardware thread level.
Number of cycles where the front-end could not advance. Any multi-cycle instructions which delay pipeline advance and apply backpressure to the front-end will be included, e.g. read-modify-write instructions. Includes cycles when the front-end did not hav
Number of taken INTR and NMI interrupts
Counts hardware prefetches that missed the L2 data cache. This event counts at the hardware thread level.
Counts the number of instructions executed by a hardware thread. This event includes INSTRUCTIONS_EXECUTED_V_PIPE and VPU_INSTRUCTIONS_EXECUTED.
Counts the number of instructions executed on the alternate pipeline, called the V-pipe. Two instructions can be executed every clock cycle, one on the U-pipe, and one on the V-pipe. The V-pipe cannot execute all instruction types, and will execute instructions only when pairing rules are met. This event can be used to see the extent of instruction pairing on a workload. It is included in INSTRUCTIONS_EXECUTED. It counts at the hardware thread level.
Counts demand data loads and stores that missed the L1 cache, but did hit a prefetch buffer. This means the cacheline was already in the process of being prefetched into L1. This is a second type of miss and is not included in DATA_READ_MISS_OR_WRITE_MISS. It is counted at the hardware thread level. This event does not count data cache misses due to hardware or software prefetches.
Counts software prefetches that are intended for the local L1 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.
Counts software prefetches that missed the local L1 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.
Number of data vprefetch2 requests seen by the L1. This is not necessarily the same number as seen by the L2 because this count includes requests that are dropped by the core. A vprefetch2 can be dropped by the core if the requested address matches anothe
Counts software prefetches that are intended for the local L2 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.
Counts software prefetches that missed the local L2 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.
Counts data loads that missed the local L2 cache, but were serviced by a remote L2 cache on the same Intel Xeon Phi coprocessor. This event counts at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills.
Counts data loads that missed the local L2 cache, and were serviced from memory (on the same Intel Xeon Phi coprocessor). This event counts at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills or standard metrics like L2 Hit/Miss Rate.
Counts data Reads for Ownership (due to a store operation) that missed the local L2 cache, but were serviced by a remote L2 cache on the same Intel Xeon Phi coprocessor. This event counts at the hardware thread level.
Counts data Reads for Ownership (due to a store operation) that missed the local L2 cache, and were serviced from memory (on the same Intel Xeon Phi coprocessor). This event counts at the hardware thread level.
Counts data loads that hit a cacheline in Exclusive state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.
Counts data loads that hit a cacheline in Modified state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.
Counts data loads that hit a cacheline in Shared state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.
Counts data loads that missed the local L2 cache, at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand misses.
Number of strongly ordered streaming vector stores that missed the L2 and were sent to the ring.
Counts the number of modified cachelines evicted from the L2 Data cache. These result in a memory write operation, also known as an explicit L2 write-back. This event counts at the hardware core level; at any given time, every executing hardware thread on the core has the same value for this counter.
Number of weakly ordered streaming vector stores that missed the L2 and were sent to the ring.
L2 Write HIT
Counts misses in the L2 TLB, at the hardware thread level. TLB Misses could have been caused by either demand data loads and stores or data prefetches.
Number of data memory reads or writes that are paired in both pipes of the pipeline
The number of cycles microcode is executing. While microcode is executing, all other threads are stalled.
Number of address generation interlock (AGI) stalls. An AGI occurring in both the U- and V- pipelines in the same clock signals this event twice.
Number of pipeline flushes that occur
Number of address generation interlock (AGI) stalls due to vscatter* and vgather* instructions.
Counts incoming snoops that hit a modified cacheline in a hardware thread's local L2. These result in a cache-to-cache transfer: the line will be evicted from the local L2, written back to memory (also called an implicit write-back), and the line will be loaded exclusively into the requesting core's cache. This event counts at the hardware core level; at any given time, every executing hardware thread on the core has the same value for this counter.
Snoop HIT in L2
This counts the number of normal reads sent to channel 0
This counts the number of normal writes sent to channel 0
This counts the number of normal reads sent to channel 1
This counts the number of normal writes sent to channel 1
Number of read transactions that were issued. In general each read transaction will read 1 64B cacheline. If there are alignment issues, then reads against multiple cache lines will each be counted individually.
VPU L1 data cache readmiss. Counts the number of occurrences.
Number of write transactions that were issued. In general each write transaction will write 1 64B cacheline. If there are alignment issues, then write against multiple cache lines will each be counted individually.
VPU L1 data cache write miss. Counts the number of occurrences.
Increments by 1 for every element to which an executed VPU instruction applies. For example, if a VPU instruction executes with a mask register containing 1, it applies to only one element and so this event increments by 1. If a VPU instruction executes with a mask register containing 0xFF, this event is incremented by 8. Counts at the hardware thread level.
Counts the number of VPU instructions executed by a hardware thread. This event is a subset of INSTRUCTIONS_EXECUTED.
Counts the number of VPU instructions that paired and executed in the v-pipe.
VPU stall on Register Dependency. Counts the number of occurrences. Dependencies will include RAW, WAW, WAR.