Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
This section provides reference for hardware events that can be monitored for the CPU(s):
The following performance-monitoring events are supported:
Any uop executed by the Divider. (This includes all divide uops, sqrt, ...)
Note that a whole rep string only counts AVX_INST.ALL once.
Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction and this is corrected by other branch handling mechanisms at the front end.
Speculative and retired branches
Speculative and retired macro-conditional branches
Speculative and retired macro-unconditional branches excluding calls and indirects
Speculative and retired direct near calls
Speculative and retired indirect branches excluding calls and returns
Speculative and retired indirect return branches.
Not taken macro-conditional branches
Taken speculative and retired macro-conditional branches
Taken speculative and retired macro-conditional branch instructions excluding calls and indirects
Taken speculative and retired direct near calls
Taken speculative and retired indirect branches excluding calls and returns
Taken speculative and retired indirect calls
Taken speculative and retired indirect branches with return mnemonic
All (macro) branch instructions retired.
All (macro) branch instructions retired.
Conditional branch instructions retired.
Conditional branch instructions retired.
Far branch instructions retired.
Direct and indirect near call instructions retired.
Direct and indirect near call instructions retired.
Direct and indirect macro near call instructions retired (captured in ring 3).
Direct and indirect macro near call instructions retired (captured in ring 3).
Return instructions retired.
Return instructions retired.
Taken branch instructions retired.
Taken branch instructions retired.
Not taken branch instructions retired.
Speculative and retired mispredicted macro conditional branches
Speculative and retired mispredicted macro conditional branches
Mispredicted indirect branches excluding calls and returns
Not taken speculative and retired mispredicted macro conditional branches
Taken speculative and retired mispredicted macro conditional branches
Taken speculative and retired mispredicted indirect branches excluding calls and returns
Taken speculative and retired mispredicted indirect calls
Taken speculative and retired mispredicted indirect branches with return mnemonic
All mispredicted macro branch instructions retired.
This event counts all mispredicted branch instructions retired. This is a precise event.
Mispredicted conditional branch instructions retired.
Mispredicted conditional branch instructions retired.
number of near branch instructions retired that were mispredicted and taken.
number of near branch instructions retired that were mispredicted and taken.
Unhalted core cycles when the thread is in ring 0
Number of intervals between processor halts while thread is in ring 0
Unhalted core cycles when thread is in rings 1, 2, or 3
Count XClk pulses when this thread is unhalted and the other thread is halted.
Reference cycles when the thread is unhalted (counts at 100 MHz rate)
Reference cycles when the at least one thread on the physical core is unhalted (counts at 100 MHz rate)
This event counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state.
This event counts the number of thread cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. The core frequency may change from time to time due to power or thermal throttling.
Core cycles when at least one thread on the physical core is not in halt state
Thread cycles when thread is not in halt state
Core cycles when at least one thread on the physical core is not in halt state
Cycles with pending L1 cache miss loads.
Cycles with pending L2 cache miss loads.
Cycles with pending memory loads.
This event counts cycles during which no instructions were executed in the execution stage of the pipeline.
Execution stalls due to L1 data cache misses
Execution stalls due to L2 cache misses.
This event counts cycles during which no instructions were executed in the execution stage of the pipeline and there were memory instructions pending (waiting for data).
Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles.
Load misses in all DTLB levels that cause page walks
DTLB demand load misses with low part of linear-to-physical address translation missed
Load operations that miss the first DTLB level but hit the second and do not cause page walks
This event counts load operations from a 2M page that miss the first DTLB level but hit the second and do not cause page walks.
This event counts load operations from a 4K page that miss the first DTLB level but hit the second and do not cause page walks.
Demand load Miss in all translation lookaside buffer (TLB) levels causes a page walk that completes of any page size.
Demand load Miss in all translation lookaside buffer (TLB) levels causes a page walk that completes (2M/4M).
Demand load Miss in all translation lookaside buffer (TLB) levels causes a page walk that completes (4K).
This event counts cycles when the page miss handler (PMH) is servicing page walks caused by DTLB load misses.
Store misses in all DTLB levels that cause page walks
DTLB store misses with low part of linear-to-physical address translation missed
Store operations that miss the first TLB level but hit the second and do not cause page walks
This event counts store operations from a 2M page that miss the first DTLB level but hit the second and do not cause page walks.
This event counts store operations from a 4K page that miss the first DTLB level but hit the second and do not cause page walks.
Store misses in all DTLB levels that cause completed page walks
Store misses in all DTLB levels that cause completed page walks (2M/4M)
Store miss in all TLB levels causes a page walk that completes. (4K)
This event counts cycles when the page miss handler (PMH) is servicing page walks caused by DTLB store misses.
Cycle count for an Extended Page table walk.
Cycles with any input/output SSE or FP assist
Number of SIMD FP assists due to input values
Number of SIMD FP assists due to Output values
Number of X87 assists due to input value.
Number of X87 assists due to output value.
Number of times an HLE execution aborted due to any reasons (multiple categories may count as one).
Number of times an HLE execution aborted due to various memory events (e.g., read/write capacity and conflicts).
Number of times an HLE execution aborted due to uncommon conditions
Number of times an HLE execution aborted due to HLE-unfriendly instructions
Number of times an HLE execution aborted due to incompatible memory type
Number of times an HLE execution aborted due to none of the previous 4 categories (e.g. interrupts)
Number of times an HLE execution aborted due to any reasons (multiple categories may count as one).
Number of times an HLE execution successfully committed
Number of times an HLE execution started.
Number of Instruction Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches
Cycles where a code fetch is stalled due to L1 instruction-cache miss.
Cycles where a code fetch is stalled due to L1 instruction-cache miss.
This event counts Instruction Cache (ICACHE) misses.
Cycles Decode Stream Buffer (DSB) is delivering 4 Uops
Cycles Decode Stream Buffer (DSB) is delivering any Uop
Cycles MITE is delivering 4 Uops
Cycles MITE is delivering any Uop
Cycles when uops are being delivered to Instruction Decode Queue (IDQ) from Decode Stream Buffer (DSB) path
Uops delivered to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path
Instruction Decode Queue (IDQ) empty cycles
Uops delivered to Instruction Decode Queue (IDQ) from MITE path
Cycles when uops are being delivered to Instruction Decode Queue (IDQ) from MITE path
Uops delivered to Instruction Decode Queue (IDQ) from MITE path
This event counts cycles during which the microcode sequencer assisted the Front-end in delivering uops. Microcode assists are used for complex instructions or scenarios that can't be handled by the standard decoder. Using other instructions, if possible, will usually improve performance.
Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy
Deliveries to Instruction Decode Queue (IDQ) initiated by Decode Stream Buffer (DSB) while Microcode Sequenser (MS) is busy
Uops initiated by Decode Stream Buffer (DSB) that are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy
Uops initiated by MITE and delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy
Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer
This event counts uops delivered by the Front-end with the assistance of the microcode sequencer. Microcode assists are used for complex instructions or scenarios that can't be handled by the standard decoder. Using other instructions, if possible, will usually improve performance.
This event count the number of undelivered (unallocated) uops from the Front-end to the Resource Allocation Table (RAT) while the Back-end of the processor is not stalled. The Front-end can allocate up to 4 uops per cycle so this event can increment 0-4 times per cycle depending on the number of unallocated uops. This event is counted on a per-core basis.
This event counts the number cycles during which the Front-end allocated exactly zero uops to the Resource Allocation Table (RAT) while the Back-end of the processor is not stalled. This event is counted on a per-core basis.
Counts cycles FE delivered 4 uops or Resource Allocation Table (RAT) was stalling FE.
Cycles per thread when 3 or more uops are not delivered to Resource Allocation Table (RAT) when backend of the machine is not stalled
Cycles with less than 2 uops delivered by the front end.
Cycles with less than 3 uops delivered by the front end.
Stall cycles because IQ is full
This event counts cycles where the decoder is stalled on an instruction with a length changing prefix (LCP).
This event counts the number of instructions retired from execution. For instructions that consist of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction. Counting continues during hardware interrupts, traps, and inside interrupt handlers. INST_RETIRED.ANY is counted by a designated fixed counter, leaving the programmable counters available for other events. Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retired instructions.
Number of instructions retired. General Counter - architectural event
Precise instruction retired event with HW to reduce effect of PEBS shadow in IP distribution
This is a non-precise version (that is, does not use PEBS) of the event that counts FP operations retired. For X87 FP operations that have no exceptions counting also includes flows that have several X87, or flows that use X87 uops in the exception handling.
This event counts the number of cycles spent waiting for a recovery after an event such as a processor nuke, JEClear, assist, hle/rtm abort etc....
Core cycles the allocator was stalled due to recovery from earlier clear event for any thread running on the physical core (e.g. misprediction or memory nuke)
Flushing of the Instruction TLB (ITLB) pages, includes 4k/2M/4M pages.
Misses at all ITLB levels that cause page walks
Operations that miss the first ITLB level but hit the second and do not cause any page walks
Code misses that miss the DTLB and hit the STLB (2M)
Core misses that miss the DTLB and hit the STLB (4K)
Misses in all ITLB levels that cause completed page walks
Code miss in all TLB levels causes a page walk that completes. (2M/4M)
Code miss in all TLB levels causes a page walk that completes. (4K)
This event counts cycles when the page miss handler (PMH) is servicing page walks caused by ITLB misses.
This event counts when new data lines are brought into the L1 Data cache, which cause other lines to be evicted from the cache.
Cycles a demand request was blocked due to Fill Buffers inavailability
L1D miss oustandings duration in cycles
Cycles with L1D load Misses outstanding.
Cycles with L1D load Misses outstanding from any thread on physical core
Number of times a request needed a FB entry but there was no entry available for it. That is the FB unavailability was dominant reason for blocking the request. A request includes cacheable/uncacheable demands that is load, store or SW prefetch. HWP are e
Not rejected writebacks that hit L2 cache
This event counts the number of L2 cache lines brought into the L2 cache. Lines are filled into the L2 cache when there was an L2 miss.
L2 cache lines in E state filling L2
L2 cache lines in I state filling L2
L2 cache lines in S state filling L2
Clean L2 cache lines evicted by demand
Dirty L2 cache lines evicted by demand
L2 code requests
Demand Data Read requests
Demand requests that miss L2 cache
Demand requests to L2 cache
Requests from L2 hardware prefetchers
RFO requests to L2 cache
L2 cache hits when fetching instructions, code reads.
L2 cache misses when fetching instructions
Demand Data Read requests that hit L2 cache
Demand Data Read miss L2, no rejects
L2 prefetch requests that hit L2 cache
L2 prefetch requests that miss L2 cache
All requests that miss L2 cache
All L2 requests
RFO requests that hit L2 cache
RFO requests that miss L2 cache
L2 or L3 HW prefetches that access L2 cache
Transactions accessing L2 pipe
L2 cache accesses when fetching instructions
Demand Data Read requests that access L2 cache
L1D writebacks that access L2 cache
L2 fill requests that access L2 cache
L2 writebacks that access L2 cache
RFO requests that access L2 cache
The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use
This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load. The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceding smaller uncompleted store. The penalty for blocked store forwarding is that the load must wait for the store to write its value to the cache before it can be issued.
Aliasing occurs when a load is issued after a store and their memory addresses are offset by 4K. This event counts the number of loads that aliased with a preceding store, resulting in an extended address check in the pipeline which can have a performance impact.
Not software-prefetch load dispatches that hit FB allocated for hardware prefetch
Not software-prefetch load dispatches that hit FB allocated for software prefetch
Cycles when L1D is locked
Cycles when L1 and L2 are locked due to UC or split lock
Core-originated cacheable demand requests missed L3
Core-originated cacheable demand requests that refer to L3
Cycles 4 Uops delivered by the LSD, but didn't come from the decoder
Cycles Uops delivered by the LSD, but didn't come from the decoder
Number of Uops delivered by the LSD.
Number of machine clears (nukes) of any type.
Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes.
This event counts the number of executed Intel AVX masked load operations that refer to an illegal address range with the mask bits set to 0.
This event counts the number of memory ordering machine clears detected. Memory ordering machine clears can result from memory address aliasing or snoops from another hardware thread or core to data inflight in the pipeline. Machine clears can have a significant performance impact if they are happening frequently.
This event is incremented when self-modifying code (SMC) is detected, which causes a machine clear. Machine clears can have a significant performance impact if they are happening frequently.
Retired load uops which data sources were L3 and cross-core snoop hits in on-pkg core cache.
Retired load uops which data sources were HitM responses from shared L3.
This event counts retired load uops that hit in the L3 cache, but required a cross-core snoop which resulted in a HITM (hit modified) in an on-pkg core cache. This does not include hardware prefetches. This is a precise event.
This event counts retired load uops that hit in the L3 cache, but required a cross-core snoop which resulted in a HIT in an on-pkg core cache. This does not include hardware prefetches. This is a precise event.
Retired load uops which data sources were L3 hit and cross-core snoop missed in on-pkg core cache.
Retired load uops which data sources were L3 hit and cross-core snoop missed in on-pkg core cache.
Retired load uops which data sources were hits in L3 without snoops required.
Retired load uops which data sources were hits in L3 without snoops required.
This event counts retired load uops where the data came from local DRAM. This does not include hardware prefetches.
This event counts retired load uops where the data came from local DRAM. This does not include hardware prefetches. This is a precise event.
Retired load uops which data sources were load uops missed L1 but hit FB due to preceding miss to the same cache line with data not ready.
Retired load uops which data sources were load uops missed L1 but hit FB due to preceding miss to the same cache line with data not ready.
Retired load uops with L1 cache hits as data sources.
Retired load uops with L1 cache hits as data sources.
Retired load uops misses in L1 cache as data sources.
This event counts retired load uops in which data sources missed in the L1 cache. This does not include hardware prefetches. This is a precise event.
Retired load uops with L2 cache hits as data sources.
Retired load uops with L2 cache hits as data sources.
Miss in mid-level (L2) cache. Excludes Unknown data-source.
Retired load uops with L2 cache misses as data sources.
Retired load uops which data sources were data hits in L3 without snoops required.
This event counts retired load uops in which data sources were data hits in the L3 cache without snoops required. This does not include hardware prefetches. This is a precise event.
Miss in last-level (L3) cache. Excludes Unknown data-source.
Miss in last-level (L3) cache. Excludes Unknown data-source.
Loads with latency value being above 128
Loads with latency value being above 16
Loads with latency value being above 256
Loads with latency value being above 32
Loads with latency value being above 4
Loads with latency value being above 512
Loads with latency value being above 64
Loads with latency value being above 8
All retired load uops.
All retired load uops. (precise Event)
All retired store uops.
This event counts all store uops retired. This is a precise event.
Retired load uops with locked access.
Retired load uops with locked access. (precise Event)
Retired load uops that split across a cacheline boundary.
This event counts load uops retired which had memory addresses spilt across 2 cache lines. A line split is across 64B cache-lines which may include a page split (4K). This is a precise event.
Retired store uops that split across a cacheline boundary.
This event counts store uops retired which had memory addresses spilt across 2 cache lines. A line split is across 64B cache-lines which may include a page split (4K). This is a precise event.
Retired load uops that miss the STLB.
Retired load uops that miss the STLB. (precise Event)
Retired store uops that miss the STLB.
Retired store uops that miss the STLB. (precise Event)
Speculative cache line split load uops dispatched to L1 cache
Speculative cache line split STA uops dispatched to L1 cache
Number of integer Move Elimination candidate uops that were eliminated.
Number of integer Move Elimination candidate uops that were not eliminated.
Number of SIMD Move Elimination candidate uops that were eliminated.
Number of SIMD Move Elimination candidate uops that were not eliminated.
Demand and prefetch data reads
Cacheable and noncachaeble code read requests
Demand Data Read requests sent to uncore
Demand RFO requests including regular RFOs, locks, ItoM
Offcore requests buffer cannot take more entries for this thread core.
Offcore outstanding cacheable Core Data Read transactions in SuperQueue (SQ), queue to uncore
Cycles when offcore outstanding cacheable Core Data Read transactions are present in SuperQueue (SQ), queue to uncore
Cycles when offcore outstanding Demand Data Read transactions are present in SuperQueue (SQ), queue to uncore
Offcore outstanding demand rfo reads transactions in SuperQueue (SQ), queue to uncore, every cycle
Offcore outstanding code reads transactions in SuperQueue (SQ), queue to uncore, every cycle
Offcore outstanding Demand Data Read transactions in uncore queue.
Cycles with at least 6 offcore outstanding Demand Data Read transactions in uncore queue
Offcore outstanding RFO store transactions in SuperQueue (SQ), queue to uncore
Offcore response can be programmed only with a specific pair of event select and counter MSR, and with specific event codes and predefine mask bit value in a dedicated MSR to specify attributes of the offcore transaction
Counts all demand & prefetch code reads that hit in the L3
Counts all demand & prefetch code reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all demand & prefetch code reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all demand & prefetch code reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all demand & prefetch code reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all demand & prefetch code reads that miss in the L3
Counts all demand & prefetch code reads that miss the L3 and the data is returned from local dram
Counts all demand & prefetch data reads that hit in the L3
Counts all demand & prefetch data reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all demand & prefetch data reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all demand & prefetch data reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all demand & prefetch data reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all demand & prefetch data reads that miss in the L3
Counts all demand & prefetch data reads that miss the L3 and the data is returned from local dram
Counts all prefetch code reads that hit in the L3
Counts all prefetch code reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch code reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch code reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch code reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch code reads that miss in the L3
Counts all prefetch code reads that miss the L3 and the data is returned from local dram
Counts all prefetch data reads that hit in the L3
Counts all prefetch data reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch data reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch data reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch data reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch data reads that miss in the L3
Counts all prefetch data reads that miss the L3 and the data is returned from local dram
Counts prefetch RFOs that hit in the L3
Counts prefetch RFOs that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts prefetch RFOs that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts prefetch RFOs that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts prefetch RFOs that hit in the L3 and the snoops sent to sibling cores return clean response
Counts prefetch RFOs that miss in the L3
Counts prefetch RFOs that miss the L3 and the data is returned from local dram
Counts all data/code/rfo reads (demand & prefetch) that hit in the L3
Counts all data/code/rfo reads (demand & prefetch) that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all data/code/rfo reads (demand & prefetch) that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all data/code/rfo reads (demand & prefetch) that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all data/code/rfo reads (demand & prefetch) that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all data/code/rfo reads (demand & prefetch) that miss in the L3
Counts all data/code/rfo reads (demand & prefetch) that miss the L3 and the data is returned from local dram
Counts all requests that hit in the L3
Counts all requests that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all requests that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all requests that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all requests that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all requests that miss in the L3
Counts all requests that miss the L3 and the data is returned from local dram
Counts all demand & prefetch RFOs that hit in the L3
Counts all demand & prefetch RFOs that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all demand & prefetch RFOs that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all demand & prefetch RFOs that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all demand & prefetch RFOs that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all demand & prefetch RFOs that miss in the L3
Counts all demand & prefetch RFOs that miss the L3 and the data is returned from local dram
Counts all demand code reads that hit in the L3
Counts all demand code reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all demand code reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all demand code reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all demand code reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all demand code reads that miss in the L3
Counts all demand code reads that miss the L3 and the data is returned from local dram
Counts demand data reads that hit in the L3
Counts demand data reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts demand data reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts demand data reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts demand data reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts demand data reads that miss in the L3
Counts demand data reads that miss the L3 and the data is returned from local dram
Counts all demand data writes (RFOs) that hit in the L3
Counts all demand data writes (RFOs) that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all demand data writes (RFOs) that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all demand data writes (RFOs) that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all demand data writes (RFOs) that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all demand data writes (RFOs) that miss in the L3
Counts all demand data writes (RFOs) that miss the L3 and the data is returned from local dram
Counts any other requests that hit in the L3
Counts any other requests that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts any other requests that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts any other requests that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts any other requests that hit in the L3 and the snoops sent to sibling cores return clean response
Counts any other requests that miss in the L3
Counts any other requests that miss the L3 and the data is returned from local dram
Counts all prefetch (that bring data to LLC only) code reads that hit in the L3
Counts all prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch (that bring data to LLC only) code reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch (that bring data to LLC only) code reads that miss in the L3
Counts all prefetch (that bring data to LLC only) code reads that miss the L3 and the data is returned from local dram
Counts prefetch (that bring data to L2) data reads that hit in the L3
Counts prefetch (that bring data to L2) data reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts prefetch (that bring data to L2) data reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts prefetch (that bring data to L2) data reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts prefetch (that bring data to L2) data reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts prefetch (that bring data to L2) data reads that miss in the L3
Counts prefetch (that bring data to L2) data reads that miss the L3 and the data is returned from local dram
Counts all prefetch (that bring data to L2) RFOs that hit in the L3
Counts all prefetch (that bring data to L2) RFOs that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch (that bring data to L2) RFOs that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch (that bring data to L2) RFOs that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch (that bring data to L2) RFOs that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch (that bring data to L2) RFOs that miss in the L3
Counts all prefetch (that bring data to L2) RFOs that miss the L3 and the data is returned from local dram
Counts prefetch (that bring data to LLC only) code reads that hit in the L3
Counts prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts prefetch (that bring data to LLC only) code reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts prefetch (that bring data to LLC only) code reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts prefetch (that bring data to LLC only) code reads that miss in the L3
Counts prefetch (that bring data to LLC only) code reads that miss the L3 and the data is returned from local dram
Counts all prefetch (that bring data to LLC only) data reads that hit in the L3
Counts all prefetch (that bring data to LLC only) data reads that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch (that bring data to LLC only) data reads that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch (that bring data to LLC only) data reads that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch (that bring data to LLC only) data reads that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch (that bring data to LLC only) data reads that miss in the L3
Counts all prefetch (that bring data to LLC only) data reads that miss the L3 and the data is returned from local dram
Counts all prefetch (that bring data to LLC only) RFOs that hit in the L3
Counts all prefetch (that bring data to LLC only) RFOs that hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded
Counts all prefetch (that bring data to LLC only) RFOs that hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded
Counts all prefetch (that bring data to LLC only) RFOs that hit in the L3 and sibling core snoops are not needed as either the core-valid bit is not set or the shared line is present in multiple cores
Counts all prefetch (that bring data to LLC only) RFOs that hit in the L3 and the snoops sent to sibling cores return clean response
Counts all prefetch (that bring data to LLC only) RFOs that miss in the L3
Counts all prefetch (that bring data to LLC only) RFOs that miss the L3 and the data is returned from local dram
Number of times any microcode assist is invoked by HW upon uop writeback.
Number of transitions from AVX-256 to legacy SSE when penalty applicable.
Number of transitions from SSE to AVX-256 when penalty applicable.
Number of DTLB page walker hits in the L1+FB
Number of DTLB page walker hits in the L2
Number of DTLB page walker hits in the L3 + XSNP
Number of DTLB page walker hits in Memory
Counts the number of Extended Page Table walks from the DTLB that hit in the L1 and FB.
Counts the number of Extended Page Table walks from the DTLB that hit in the L2.
Counts the number of Extended Page Table walks from the DTLB that hit in the L3.
Counts the number of Extended Page Table walks from the DTLB that hit in memory.
Counts the number of Extended Page Table walks from the ITLB that hit in the L1 and FB.
Counts the number of Extended Page Table walks from the ITLB that hit in the L2.
Counts the number of Extended Page Table walks from the ITLB that hit in the L2.
Counts the number of Extended Page Table walks from the ITLB that hit in memory.
Number of ITLB page walker hits in the L1+FB
Number of ITLB page walker hits in the L2
Number of ITLB page walker hits in the L3 + XSNP
Number of ITLB page walker hits in Memory
Resource-related stall cycles
Cycles stalled due to re-order buffer full.
Cycles stalled due to no eligible RS entry available.
This event counts cycles during which no instructions were allocated because no Store Buffers (SB) were available.
Count cases of saving new LBR
This event counts cycles when the Reservation Station ( RS ) is empty for the thread. The RS is a structure that buffers allocated micro-ops from the Front-end. If there are many cycles when the RS is empty, it may represent an underflow of instructions delivered from the Front-end.
Counts end of periods where the Reservation Station (RS) was empty. Could be useful to precisely locate Frontend Latency Bound issues.
Number of times an RTM execution aborted due to any reasons (multiple categories may count as one).
Number of times an RTM execution aborted due to various memory events (e.g. read/write capacity and conflicts)
Number of times an RTM execution aborted due to various memory events (e.g., read/write capacity and conflicts).
Number of times an RTM execution aborted due to HLE-unfriendly instructions
Number of times an RTM execution aborted due to incompatible memory type
Number of times an RTM execution aborted due to none of the previous 4 categories (e.g. interrupt)
Number of times an RTM execution aborted due to any reasons (multiple categories may count as one).
Number of times an RTM execution successfully committed
Number of times an RTM execution started.
DTLB flush attempts of the thread-specific entries
STLB flush attempts
Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
Counts the number of times a class of instructions (e.g., vzeroupper) that may cause a transactional abort was executed inside a transactional region
Counts the number of times an instruction execution caused the transactional nest count supported to be exceeded
Counts the number of times a XBEGIN instruction was executed inside an HLE transactional region.
Counts the number of times an HLE XACQUIRE instruction was executed inside an RTM transactional region
Number of times a transactional abort was signaled due to a data capacity limitation for transactional writes.
Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address
Number of times an HLE transactional execution aborted due to XRELEASE lock not satisfying the address and value requirements in the elision buffer
Number of times an HLE transactional execution aborted due to NoAllocatedElisionBuffer being non-zero.
Number of times an HLE transactional execution aborted due to an unsupported read alignment from the elision buffer.
Number of times a HLE transactional region aborted due to a non XRELEASE prefixed instruction writing to an elided lock in the elision buffer
Number of times HLE lock could not be elided due to ElisionBufferAvailable being zero.
Each cycle count number of valid entries in Coherency Tracker queue from allocation till deallocation. Aperture requests (snoops) appear as NC decoded internally and become coherent (snoop L3, access memory)
Number of entries allocated. Account for Any type: e.g. Snoop, Core aperture, etc.
Each cycle count number of all Core outgoing valid entries. Such entry is defined as valid from it's allocation till first of IDI0 or DRS0 messages is sent out. Accounts for Coherent and non-coherent traffic.
Total number of Core outgoing entries allocated. Accounts for Coherent and non-coherent traffic.
Number of Writes allocated - any write transactions: full/partials writes and evictions.
L3 Lookup any request that access cache and found line in E or S-state
L3 Lookup any request that access cache and found line in I-state
L3 Lookup any request that access cache and found line in M-state
L3 Lookup any request that access cache and found line in MESI-state
L3 Lookup external snoop request that access cache and found line in E or S-state
L3 Lookup external snoop request that access cache and found line in I-state
L3 Lookup external snoop request that access cache and found line in M-state
L3 Lookup external snoop request that access cache and found line in MESI-state
L3 Lookup read request that access cache and found line in E or S-state
L3 Lookup read request that access cache and found line in I-state
L3 Lookup read request that access cache and found line in M-state
L3 Lookup read request that access cache and found line in any MESI-state
L3 Lookup write request that access cache and found line in E or S-state
L3 Lookup write request that access cache and found line in I-state
L3 Lookup write request that access cache and found line in M-state
L3 Lookup write request that access cache and found line in MESI-state
A cross-core snoop resulted from L3 Eviction which hits a modified line in some processor core.
An external snoop hits a modified line in some processor core.
A cross-core snoop initiated by this Cbox due to processor core memory request which hits a modified line in some processor core.
A cross-core snoop resulted from L3 Eviction which hits a non-modified line in some processor core.
An external snoop hits a non-modified line in some processor core.
A cross-core snoop initiated by this Cbox due to processor core memory request which hits a non-modified line in some processor core.
A cross-core snoop resulted from L3 Eviction which misses in some processor core.
An external snoop misses in some processor core.
A cross-core snoop initiated by this Cbox due to processor core memory request which misses in some processor core.
This 48-bit fixed counter counts the UCLK cycles
Cycles per thread when uops are executed in port 0
Cycles per thread when uops are executed in port 1
Cycles per thread when uops are executed in port 2
Cycles per thread when uops are executed in port 3
Cycles per thread when uops are executed in port 4
Cycles per thread when uops are executed in port 5
Cycles per thread when uops are executed in port 6
Cycles per thread when uops are executed in port 7
Number of uops executed on the core.
Cycles at least 1 micro-op is executed from any thread on physical core
Cycles at least 2 micro-op is executed from any thread on physical core
Cycles at least 3 micro-op is executed from any thread on physical core
Cycles at least 4 micro-op is executed from any thread on physical core
Cycles with no micro-ops executed from any thread on physical core
This events counts the cycles where at least one uop was executed. It is counted per thread.
This events counts the cycles where at least two uop were executed. It is counted per thread.
This events counts the cycles where at least three uop were executed. It is counted per thread.
Cycles where at least 4 uops were executed per-thread
Counts number of cycles no uops were dispatched to be executed on this thread.
Cycles per thread when uops are executed in port 0
Cycles per core when uops are exectuted in port 0
Cycles per thread when uops are executed in port 1
Cycles per core when uops are exectuted in port 1
Cycles per thread when uops are executed in port 2
Cycles per core when uops are dispatched to port 2
Cycles per thread when uops are executed in port 3
Cycles per core when uops are dispatched to port 3
Cycles per thread when uops are executed in port 4
Cycles per core when uops are exectuted in port 4
Cycles per thread when uops are executed in port 5
Cycles per core when uops are exectuted in port 5
Cycles per thread when uops are executed in port 6
Cycles per core when uops are exectuted in port 6
Cycles per thread when uops are executed in port 7
Cycles per core when uops are dispatched to port 7
This event counts the number of uops issued by the Front-end of the pipeline to the Back-end. This event is counted at the allocation stage and will count both retired and non-retired uops.
Cycles when Resource Allocation Table (RAT) does not issue Uops to Reservation Station (RS) for all threads
Number of flags-merge uops being allocated. Such uops considered perf sensitive; added by GSR u-arch.
Number of Multiply packed/scalar single precision uops allocated
Number of slow LEA uops being allocated. A uop is generally considered SlowLea if it has 3 sources (e.g. 2 sources + immediate) regardless if as a result of LEA instruction or not.
Cycles when Resource Allocation Table (RAT) does not issue Uops to Reservation Station (RS) for the thread
Actually retired uops.
Actually retired uops.
Cycles without actually retired uops.
This event counts the number of retirement slots used each cycle. There are potentially 4 slots that can be used each cycle - meaning, 4 uops or 4 instructions could retire each cycle.
Retirement slots used.
Cycles without actually retired uops.
Cycles with less than 10 actually retired uops.