Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
The L2 is the last and longest-latency level in the memory hierarchy before the main memory (DRAM) or MCDRAM. While L2 hits are serviced much more quickly than hits in DRAM or MCDRAM, they can still incur a significant performance penalty. This metric also includes coherence penalties for shared data. The L2 Hit Bound metric shows a ratio of cycles spent handling L2 hits to all cycles. The cycles spent handling L2 hits are calculated as L2 CACHE HIT COST * L2 CACHE HIT COUNT where L2 CACHE HIT COST is a constant measured as typical L2 access latency in cycles.
A significant proportion of cycles is being spent on data fetches that miss the L1 but hit the L2. This metric includes coherence penalties for shared data. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider the same performance tuning as you would apply for an L2-missing workload - reduce the data working set size, improve data access locality, consider blocking or otherwise partitioning your working set so that it fits in the L1, or better exploit hardware prefetchers. Consider using software prefetchers, but note that they can interfere with normal loads, potentially increasing latency, and that they increase pressure on the memory system.