Intel® Math Kernel Library 11.3 Update 4 Developer Guide
To start working with the benchmark, follow the instructions below:
On a cluster file system, unpack the Intel Optimized HPCG package to a directory accessible by all nodes. Read and accept the license as indicated in readme.txt file included in the package.
Change directory to hpcg/bin.
Determine which prebuilt version of the benchmark is best for your system or follow QUICKSTART instructions to build a version of the benchmark for your MPI implementation. When doing this, note that native runs on Intel Xeon Phi coprocessors and symmetric runs require Intel® MPI and only offload versions of the benchmark for Intel Xeon Phi coprocessors can be built with other MPI implementations.
Ensure the following:
The Intel AVX and Intel AVX2 optimized versions perform best with one process per socket and one OpenMP* thread per core skipping hyper-threads: set the affinity as KMP_AFFINITY=granularity=fine,compact,1,0. Specifically, for a 128-node cluster with two Intel Xeon Processor E5-2697 v3 per node, run the executable as follows:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 512 -perhost 2 env OMP_NUM_THREADS=14 KMP_AFFINITY=granularity=fine,compact,1,0 bin/xhpcg_avx2 --n=168
The Intel Xeon Phi coprocessor optimized version for the offload mode performs best with one MPI process per coprocessor and four threads for each Intel Xeon Phi coprocessor core with a single core left free. Specifically, for a 128-node cluster with two Intel Xeon Phi coprocessors 7120D per node, run the executable as follows:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 256 –perhost 2 env –u OMP_NUM_THREADS –u KMP_AFFINITY MIC_OMP_NUM_THREADS=240 MIC_LD_LIBRARY_PATH=./bin/lib/mic:$MIC_LD_LIBRARY_PATH LD_LIBRARY_PATH=./bin/lib/mic:./bin/lib/intel64:$LD_LIBRARY_PATH ./bin/xhpcg_offload --n=168
In the symmetric mode, choose the number of MPI processes per host and per coprocessor to balance the performance of the processes. Specifically, for a 128-node cluster with one Intel Xeon Phi coprocessor 7120D per node, two MPI ranks per host, and two MPI ranks per coprocessor, run the executable as follows:
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 256 -perhost 2 env OMP_NUM_THREADS=14 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_avx2 --n=144 : -n 256 –perhost 2 env OMP_NUM_THREADS=120 KMP_AFFINITY=compact ./bin/xhpcg_mic --n=144
For symmetric runs, .machinefile must include the list of Intel Xeon processor hosts followed by the list of Intel Xeon Phi coprocessors.
When the benchmark completes execution, which usually takes a few minutes, find the YAML file with official results in the current directory. The performance rating of the benchmarked system is in the last section of the file:
HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 |