Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help
To introduce new uOps into the pipeline, the core must either fetch them from a decoded instruction cache, or fetch the instructions themselves from memory and then decode them. In the latter path, the requests to memory first go through the L1I (level 1 instruction) cache that caches the recent code working set. Front-end stalls can accrue when fetched instructions are not present in the L1I. Possible reasons are a large code working set or fragmentation between hot and cold code. In the latter case, when a hot instruction is fetched into the L1I, any cold code on its cache line is brought along with it. This may result in the eviction of other, hotter code.
A significant proportion of instruction fetches are missing in the instruction cache. Use profile-guided optimization to reduce the size of hot code regions. Consider compiler options to reorder functions so that hot functions are located together. If your application makes significant use of macros, try to reduce this by either converting the relevant macros to functions or using linker options to eliminate repeated code. Consider the Os/O1 optimization level or the following subset of optimizations to decrease your code footprint: a) use inlining only when it decreases the footprint; b) disable loop unrolling; c) disable intrinsic inlining.