Intel® VTune™ Amplifier XE and Intel® VTune™ Amplifier for Systems Help

Identifying OS Thread Migration Using Intel® VTune™ Amplifier

Today's complex operating systems use a scheduler to assign application threads, also known as software threads, to processor cores. The scheduler may choose the placement of the application threads on the physical cores depending on a number of different factors such as system state, system policies, and so on. A software thread may execute on a core for some period of time before being swapped out to wait. A software thread may have to wait for a number of reasons, such as being blocked for I/O. If available, another software thread may be given a chance to execute on this core. When the original software thread is once again available to execute, the scheduler may migrate the thread over to another core to ensure timely execution. This poses a problem to the newer computing architectures as this software thread migration disassociates the thread from data that has already been fetched into the caches resulting in longer data access latencies. This problem is further amplified in Non-Uniform Memory Access (NUMA) architectures, where each processor has its own local memory module that it can access directly with a distinct performance advantage. In a NUMA architecture, when a software thread is migrated to another core, the data stored in the earlier core's local memory becomes remote and memory access times increase significantly. Hence, thread migration can hurt performance making it important to identify if it is occurring in your application.

You may use the VTune Amplifier (GUI or amplxe-cl) to identify software thread migration in applications running on the Intel architecture. To identify OS thread migration, run the Basic Hotspots analysis or Advanced Hotspots analysis on your application.

To identify thread migration using the GUI, select the Core/Thread/Function/Call Stack grouping.

Expand core nodes to see the number of software threads. In general, you need the total number of threads to be less than or equal to the total number of hardware threads supported by the CPU. In addition to this, you need the threads to be equally distributed across the cores. If you see more than the expected number of software threads under any core in your result, there is a thread migration occurring in your application. In the above example, there are 12 OpenMP* worker threads instead of 2 threads (since this is an Intel® Xeon® processor supporting Intel® Hyper-Threading Technology), executing on core_8. This indicates thread migration.

Select the Thread/H/W Context grouping to analyze thread migration the Timeline pane.

Expand the thread nodes to see the number of CPUs where this thread was executed and analyze thread execution over time. In the example above, OpenMP thread #0 was executing on cpu_23 and then migrated to cpu_47.

Alternately, we can view these results directly from the command line as follows:

> amplxe-cl -group-by thread,cpuid -report hotspots -r  /temp/test/omp -s "H/W Context" -q | less
		

Thread                  H/W Context  CPU Time:Self
------------------------------  -----------  -------------
OMP Worker Thread #5 (0x3d86)    cpu_0                0.004
matmul-intel64 (0x3d52)          cpu_1                0.013
OMP Worker Thread #15 (0x3d90)   cpu_10               2.418
matmul-intel64 (0x3d52)          cpu_10               2.023
OMP Worker Thread #8 (0x3d89)    cpu_10               0.687
OMP Worker Thread #13 (0x3d8e)   cpu_10               0.097
OMP Worker Thread #6 (0x3d87)    cpu_10               0.065
OMP Worker Thread #4 (0x3d85)    cpu_10               0.059
OMP Worker Thread #1 (0x3d82)    cpu_10               0.048
OMP Worker Thread #9 (0x3d8a)    cpu_10               0.034
OMP Worker Thread #11 (0x3d8c)   cpu_10               0.009

Similarly, you can notice the large number of OpenMP worker threads running on cpu_10.

Correcting Thread Migration

You can correct the effects of thread migration by setting the thread affinity. Thread affinity refers to restricting the execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Intel® runtime library has the ability to bind OpenMP threads to physical processing units. You can also use the KMP_AFFINITY and KMP_PLACE_THREADS environment variables provided by the Intel OpenMP runtime to set the thread affinity for your application.

See Also