Events for Intel® Xeon Phi™ Coprocessor (Code Name: Knights Corner)

This section provides reference for hardware events that can be monitored for the CPU(s):

Intel® Xeon Phi™ coprocessor

The following performance-monitoring events are supported:

BANK_CONFLICTS

Number of actual bank conflicts

BRANCHES

Number of taken and not taken branches, including: conditional branches, jumps, calls, returns, software interrupts, and interrupt returns

BRANCHES_MISPREDICTED

Number of branch mispredictions that occurred on BTB hits. BTB misses are not considered branch mispredicts because no prediction exists for them yet.

CODE_READ

Number of instruction reads; whether the read is cacheable or noncacheable

CPU_CLK_UNHALTED

The number of cycles (commonly known as clockticks) where any thread on a core is active. A core is active if any thread on that core is not halted. This event is counted at the core level ? at any given time, all the hardware threads running on the same core will have the same value.

DATA_CACHE_LINES_WRITTEN_BACK

Number of dirty lines (all) that are written back, regardless of the cause

DATA_PAGE_WALK

Counts misses in the L1 TLB, at the hardware thread level. TLB Misses could have been caused by either demand data loads and stores or data prefetches.

DATA_READ

Number of memory data reads which hit the internal data cache (L1). Cache accesses resulting from prefetch instructions are included.

DATA_READ_MISS

Number of memory read accesses that miss the internal data cache whether or not the access is cacheable or noncacheable. Cache accesses resulting from prefetch instructions are included.

DATA_READ_MISS_OR_WRITE_MISS

Counts demand data loads and stores that missed the L1 cache, at the hardware thread level. This event does not include misses for cachelines that were in the process of being prefetched into L1. This event does not count data cache misses due to hardware or software prefetches.

DATA_READ_OR_WRITE

Counts demand data loads and stores, at the hardware thread level. This event could also be referred to as L1 data cache accesses. This event does not count data cache accesses due to hardware or software prefetches. It does include VPU loads generated by instructions like vgather/vloadunpack/etc. VPU_DATA_READ and VPU_DATA_WRITE are subsets of this event.

DATA_WRITE

Number of memory data writes which hit the internal data cache (L1).

DATA_WRITE_MISS

Number of memory write accesses that miss the internal data cache whether or not the access is cacheable or noncacheable

EXEC_STAGE_CYCLES

Counts the number of cycles where an instruction was in execution stage, except in the FP or VPU execution units. Counts at the hardware thread level.

FE_STALLED

Number of cycles where the front-end could not advance. Any multi-cycle instructions which delay pipeline advance and apply backpressure to the front-end will be included, e.g. read-modify-write instructions. Includes cycles when the front-end did not hav

HARDWARE_INTERRUPTS

Number of taken INTR and NMI interrupts

HWP_L2MISS

Counts hardware prefetches that missed the L2 data cache. This event counts at the hardware thread level.

INSTRUCTIONS_EXECUTED

Counts the number of instructions executed by a hardware thread. This event includes INSTRUCTIONS_EXECUTED_V_PIPE and VPU_INSTRUCTIONS_EXECUTED.

INSTRUCTIONS_EXECUTED_V_PIPE

Counts the number of instructions executed on the alternate pipeline, called the V-pipe. Two instructions can be executed every clock cycle, one on the U-pipe, and one on the V-pipe. The V-pipe cannot execute all instruction types, and will execute instructions only when pairing rules are met. This event can be used to see the extent of instruction pairing on a workload. It is included in INSTRUCTIONS_EXECUTED. It counts at the hardware thread level.

L1_DATA_HIT_INFLIGHT_PF1

Counts demand data loads and stores that missed the L1 cache, but did hit a prefetch buffer. This means the cacheline was already in the process of being prefetched into L1. This is a second type of miss and is not included in DATA_READ_MISS_OR_WRITE_MISS. It is counted at the hardware thread level. This event does not count data cache misses due to hardware or software prefetches.

L1_DATA_PF1

Counts software prefetches that are intended for the local L1 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.

L1_DATA_PF1_MISS

Counts software prefetches that missed the local L1 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.

L1_DATA_PF2

Number of data vprefetch2 requests seen by the L1. This is not necessarily the same number as seen by the L2 because this count includes requests that are dropped by the core. A vprefetch2 can be dropped by the core if the requested address matches anothe

L2_DATA_PF2

Counts software prefetches that are intended for the local L2 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.

L2_DATA_PF2_MISS

Counts software prefetches that missed the local L2 cache. May include both L1 and L2 prefetches. This event counts at the hardware thread level.

L2_DATA_READ_MISS_CACHE_FILL

Counts data loads that missed the local L2 cache, but were serviced by a remote L2 cache on the same Intel Xeon Phi coprocessor. This event counts at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills.

L2_DATA_READ_MISS_MEM_FILL

Counts data loads that missed the local L2 cache, and were serviced from memory (on the same Intel Xeon Phi coprocessor). This event counts at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills or standard metrics like L2 Hit/Miss Rate.

L2_DATA_WRITE_MISS_CACHE_FILL

Counts data Reads for Ownership (due to a store operation) that missed the local L2 cache, but were serviced by a remote L2 cache on the same Intel Xeon Phi coprocessor. This event counts at the hardware thread level.

L2_DATA_WRITE_MISS_MEM_FILL

Counts data Reads for Ownership (due to a store operation) that missed the local L2 cache, and were serviced from memory (on the same Intel Xeon Phi coprocessor). This event counts at the hardware thread level.

L2_READ_HIT_E

Counts data loads that hit a cacheline in Exclusive state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.

L2_READ_HIT_M

Counts data loads that hit a cacheline in Modified state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.

L2_READ_HIT_S

Counts data loads that hit a cacheline in Shared state in the local L2 cache. This event counts at the hardware thread level. It includes L2 prefetches and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand accesses.

L2_READ_MISS

Counts data loads that missed the local L2 cache, at the hardware thread level. It includes L2 prefetches that missed the local L2 cache and so is not useful for determining standard metrics like L2 Hit/Miss rate that are normally based on demand misses.

L2_STRONGLY_ORDERED_STREAMING_VSTORES_MISS

Number of strongly ordered streaming vector stores that missed the L2 and were sent to the ring.

L2_VICTIM_REQ_WITH_DATA

Counts the number of modified cachelines evicted from the L2 Data cache. These result in a memory write operation, also known as an explicit L2 write-back. This event counts at the hardware core level; at any given time, every executing hardware thread on the core has the same value for this counter.

L2_WEAKLY_ORDERED_STREAMING_VSTORE_MISS

Number of weakly ordered streaming vector stores that missed the L2 and were sent to the ring.

L2_WRITE_HIT

L2 Write HIT

LONG_DATA_PAGE_WALK

Counts misses in the L2 TLB, at the hardware thread level. TLB Misses could have been caused by either demand data loads and stores or data prefetches.

MEMORY_ACCESSES_IN_BOTH_PIPES

Number of data memory reads or writes that are paired in both pipes of the pipeline

MICROCODE_CYCLES

The number of cycles microcode is executing. While microcode is executing, all other threads are stalled.

PIPELINE_AGI_STALLS

Number of address generation interlock (AGI) stalls. An AGI occurring in both the U- and V- pipelines in the same clock signals this event twice.

PIPELINE_FLUSHES

Number of pipeline flushes that occur

PIPELINE_SG_AGI_STALLS

Number of address generation interlock (AGI) stalls due to vscatter* and vgather* instructions.

SNP_HITM_L2

Counts incoming snoops that hit a modified cacheline in a hardware thread's local L2. These result in a cache-to-cache transfer: the line will be evicted from the local L2, written back to memory (also called an implicit write-back), and the line will be loaded exclusively into the requesting core's cache. This event counts at the hardware core level; at any given time, every executing hardware thread on the core has the same value for this counter.

SNP_HIT_L2

Snoop HIT in L2

UNC_F_CH0_NORMAL_READ

This counts the number of normal reads sent to channel 0

UNC_F_CH0_NORMAL_WRITE

This counts the number of normal writes sent to channel 0

UNC_F_CH1_NORMAL_READ

This counts the number of normal reads sent to channel 1

UNC_F_CH1_NORMAL_WRITE

This counts the number of normal writes sent to channel 1

VPU_DATA_READ

Number of read transactions that were issued. In general each read transaction will read 1 64B cacheline. If there are alignment issues, then reads against multiple cache lines will each be counted individually.

VPU_DATA_READ_MISS

VPU L1 data cache readmiss. Counts the number of occurrences.

VPU_DATA_WRITE

Number of write transactions that were issued. In general each write transaction will write 1 64B cacheline. If there are alignment issues, then write against multiple cache lines will each be counted individually.

VPU_DATA_WRITE_MISS

VPU L1 data cache write miss. Counts the number of occurrences.

VPU_ELEMENTS_ACTIVE

Increments by 1 for every element to which an executed VPU instruction applies. For example, if a VPU instruction executes with a mask register containing 1, it applies to only one element and so this event increments by 1. If a VPU instruction executes with a mask register containing 0xFF, this event is incremented by 8. Counts at the hardware thread level.

VPU_INSTRUCTIONS_EXECUTED

Counts the number of VPU instructions executed by a hardware thread. This event is a subset of INSTRUCTIONS_EXECUTED.

VPU_INSTRUCTIONS_EXECUTED_V_PIPE

Counts the number of VPU instructions that paired and executed in the v-pipe.

VPU_STALL_REG

VPU stall on Register Dependency. Counts the number of occurrences. Dependencies will include RAW, WAW, WAR.