Improving Performance on Intel Xeon Phi Coprocessors

To improve performance of Intel MKL on Intel Xeon Phi coprocessors, use the following tips, which are specific to Intel MIC Architecture. General performance improvement recommendations provided in Coding Techniques also apply.

For more information, see the Knowledge Base article at http://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-coprocessor.

Memory Allocation

Performance of many Intel MKL routines improves when input and output data reside in memory allocated with 2MB pages because this enables you to address more memory with less pages and thus reduce the overhead of translating between virtual and physical memory addresses compared to memory allocated with the default page size of 4K. For more information, refer to Intel® 64 and IA-32 Architectures Optimization Reference Manual and Intel® 64 and IA-32 Architectures Software Developer's Manual (connect to http://www.intel.com/ and enter the name of each document in the Find Content text box).

To allocate memory with 2MB pages, you can use the mmap system call with the MAP_HUGETLB flag. You can alternatively use the libhugetlbfs library. See the white paper at http://software.intel.com/sites/default/files/Large_pages_mic_0.pdf for more information.

To enable allocation of memory with 2MB pages for data of size exceeding 2MB and transferred with offload pragmas, set the MIC_USE_2MB_BUFFERS environment variable to an appropriate value. This setting ensures that all pointer-based variables whose run-time length exceeds this value will be allocated in 2MB pages. For example, with MIC_USE_2MB_BUFFERS=64K, variables with run-time length exceeding 64 KB will be allocated in 2MB pages. For more details, see Intel® Compiler User and Reference Guides, available in the Intel Software Documentation Library.

Specifying the maximum amount of memory on a coprocessor that can be used for Automatic Offload computations typically enhances the performance by enabling Intel MKL to reserve and keep the memory on the coprocessor during Automatic Offload computations. You can specify the maximum memory by setting the MKL_MIC_MAX_MEMORY environment variable to a value such as 2 GB.

Data Alignment and Leading Dimensions

To improve performance of Intel MKL FFT functions, follow these recommendations:

Align the first element of the input data on 64-byte boundaries
For two- or higher-dimensional single-precision transforms, use leading dimensions (strides) divisible by 8 but not divisible by 16
For two- or higher-dimensional double-precision transforms, use leading dimensions divisible by 4 but not divisible by 8

For other Intel MKL function domains, use general recommendations for data alignment.

Number of Threads

For FFT, use a number of threads depending on the total size of the input and output data for the transform:

A power of two, if the total size is less than Number-of-Phi-Cores*0.5 MB
4*Number-of-Phi-Cores, if the total size is greater than Number-of-Phi-Cores*0.5 MB

Here Number-of-Phi-Cores is the number of Intel Xeon Phi coprocessors on the system.

For more information, see Improving Performance with Threading and SettingDetermining the Number of OpenMP* Threads.

OpenMP Thread Affinity

To improve performance of Intel MKL routines, set KMP_AFFINITY=balanced for all function domains.

Intel® Threading Building Blocks Facilities

To improve performance of Intel MKL routines, use the tbb::affinity_partitioner class.

To adjust the number of threads (for example, see Number of Threads for FFT), use the tbb::task_scheduler_init class.

For more information, see the Intel® TBB documentation at https://www.threadingbuildingblocks.org/documentation.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.