Intel® Math Kernel Library 11.3 Update 4 Developer Guide
Intel MKL supports Intel® Xeon Phi™ coprocessors in these modes:
Native
Hybrid Offload
For details of the Native mode, see Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors.
The Hybrid Offload mode combines use of different parallelization methods and offloading computations to coprocessors. In this mode, the host processor uses fewer cores for MPI than the total number of physical cores, also uses OpenMP* or POSIX threads, and offloads chunks of the problem to the Intel Xeon Phi coprocessor.
Native mode is required to run MPI processes directly on the Intel Xeon Phi coprocessors. If the MPI processes run solely on the Intel® Xeon® processors, the coprocessors are used in an offload mode.
In many cases, the host Intel Xeon processor has more memory than the Intel Xeon Phi coprocessor. Therefore, the MPI processes have access to more memory when run on the host processors than on the coprocessors.
HPL code is homogeneous by nature: it requires that each MPI process runs in an environment with similar CPU and memory constraints. If for some reason, one node is twice as powerful as another node, in the past you could balance this only by running two MPI processes on the faster node.
Intel MKL now supports heterogeneous Intel Optimized MP LINPACK Benchmark. Heterogeneity means that Intel MKL supports a data distribution that can be balanced to the performance requirements of each node if there is enough memory on that node to support any additional work. The Intel Optimized MP LINPACK Benchmark supports:
Intra-node heterogeneity,
where a node includes different processing units with different compute capabilities. To use intra-node heterogeneity, where work is shared between the Intel Xeon processors and Intel Xeon Phi coprocessors, use the hybrid offload techniques.
Inter-node heterogeneity,
where the nodes themselves can differ. For information on how to configure Intel MKL to use the inter-node heterogeneity, see Heterogeneous Intel Optimized MP LINPACK Benchmark.
To maximize performance, increase the memory on the host processor or processors (64 GB per coprocessor is ideal) and run a large problem and large block-size. Such runs offload pieces of work to the coprocessors. Although this method increases the PCIe bus traffic, it is worthwhile for solving a problem that is large enough.
If the amount of memory on the host processor is small, you might get the best performance by running natively instead of offloading.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 |