Getting Started with Intel Optimized HPCG

To start working with the benchmark, follow the instructions below:

On a cluster file system, unpack the Intel Optimized HPCG package to a directory accessible by all nodes. Read and accept the license as indicated in readme.txt file included in the package.
Change directory to hpcg/bin.
Determine which prebuilt version of the benchmark is best for your system or follow QUICKSTART instructions to build a version of the benchmark for your MPI implementation. When doing this, note that native runs on Intel Xeon Phi coprocessors and symmetric runs require Intel® MPI and only offload versions of the benchmark for Intel Xeon Phi coprocessors can be built with other MPI implementations.
Ensure the following:
- Intel® C/C++ Compiler and MPI run-time libraries are available through the LD_LIBRARY_PATH environment variable.
- To run the benchmark on Intel Xeon Phi coprocessors, Intel® Manycore Platform Software Stack (Intel® MPSS) run-time libraries are available through the MIC_LD_LIBRARY_PATH environment variable.
- To run the offload version of the benchmark, Intel® Parallel Studio XE Composer Edition or its Redistributable Library package is installed (for details, see https://software.intel.com/en-us/articles/redistributables-for-intel-parallel-studio-xe-2015-composer-edition-for-linux). For the supported versions of Intel Parallel Studio XE Composer Edition, see Intel MKL System Requirements.
Run the chosen version of the benchmark as explained below:
- The Intel AVX and Intel AVX2 optimized versions perform best with one process per socket and one OpenMP* thread per core skipping hyper-threads: set the affinity as KMP_AFFINITY=granularity=fine,compact,1,0. Specifically, for a 128-node cluster with two Intel Xeon Processor E5-2697 v3 per node, run the executable as follows:
```
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 
512 -perhost 2 env OMP_NUM_THREADS=14 
KMP_AFFINITY=granularity=fine,compact,1,0 bin/xhpcg_avx2 --n=168
```
- The Intel Xeon Phi coprocessor optimized version for the offload mode performs best with one MPI process per coprocessor and four threads for each Intel Xeon Phi coprocessor core with a single core left free. Specifically, for a 128-node cluster with two Intel Xeon Phi coprocessors 7120D per node, run the executable as follows:
```
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 
256 –perhost 2 env –u OMP_NUM_THREADS –u KMP_AFFINITY 
MIC_OMP_NUM_THREADS=240 
MIC_LD_LIBRARY_PATH=./bin/lib/mic:$MIC_LD_LIBRARY_PATH 
LD_LIBRARY_PATH=./bin/lib/mic:./bin/lib/intel64:$LD_LIBRARY_PATH 
./bin/xhpcg_offload --n=168
```
- In the symmetric mode, choose the number of MPI processes per host and per coprocessor to balance the performance of the processes. Specifically, for a 128-node cluster with one Intel Xeon Phi coprocessor 7120D per node, two MPI ranks per host, and two MPI ranks per coprocessor, run the executable as follows:
```
#> I_MPI_ADJUST_ALLREDUCE=5 mpiexec.hydra –machinefile .machinefile -n 
256 -perhost 2 env OMP_NUM_THREADS=14 
KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_avx2 --n=144 : -n 256 
–perhost 2 env OMP_NUM_THREADS=120 
KMP_AFFINITY=compact ./bin/xhpcg_mic --n=144
```
  For symmetric runs, .machinefile must include the list of Intel Xeon processor hosts followed by the list of Intel Xeon Phi coprocessors.
When the benchmark completes execution, which usually takes a few minutes, find the YAML file with official results in the current directory. The performance rating of the benchmarked system is in the last section of the file:

HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804