Intel® Math Kernel Library 11.3 Update 4 Developer Guide

Choosing Best Configuration and Problem Sizes

The performance of the Intel Optimized HPCG depends on many system parameters including, but not limited to, hardware configuration of the host, number and configuration of coprocessors, and MPI implementation used. To get the best performance for a specific system configuration, choose the combination of the following parameters as explained below:

On Intel Xeon processor-based clusters, use the Intel AVX or Intel AVX2 optimized version of the benchmark depending on the supported instruction set and run one MPI process per CPU socket and one OpenMP* thread per physical CPU core skipping simultaneous multithreading (SMT) threads.

Intel Xeon Phi coprocessor-enabled systems support symmetric and offload execution modes. In the offload mode, the benchmark uses the host for MPI communication and offloads computational work to the Intel Xeon Phi coprocessors. In the symmetric mode, MPI ranks run on both Intel Xeon processors and Intel Xeon Phi coprocessors, which potentially results in better performance. Offload mode uses fewer MPI processes per system and scales better for large runs. Native mode requires more MPI processes per node to achieve good balancing, which may however lead to limited scalability.

On systems with a single Intel Xeon Phi coprocessor, use the symmetric execution mode with one MPI process per socket and two MPI processes per coprocessor. On the Intel Xeon processor host, each process should run one OpenMP thread per processor core skipping hyper-threads. On the Intel Xeon Phi coprocessor, each process should run four OpenMP threads per core with one core left free. For example: on Intel Xeon Phi coprocessor 7120D, which has 61 cores, each of two MPI processes should run 120 OpenMP threads.

On systems with two or more Intel Xeon Phi coprocessors, offload mode works best with two MPI processes per coprocessor. Set the number of OpenMP threads for coprocessors to four for each coprocessor core and leave one core free. For example: on Intel Xeon Phi coprocessor 7120D, which has 61 cores, each MPI process should run 120 OpenMP threads.

Intel Xeon Phi coprocessors may have 57, 60, or 61 cores, depending on the specific model, with each core supporting four threads. Set the number of OpenMP processes in benchmark runs with Intel Xeon Phi coprocessors to use all cores but one and reserve that one for MPI or offload communications. For example: for 61-core coprocessors, the benchmark should use 240 threads.

For best performance, use the problem size that is large enough to better utilize available cores, but not too large, so that all tasks fit the available memory.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804