Optimizing the Result on a Cluster

To benchmark a cluster, follow the sequence of steps below (some of them are optional). Pay special attention to the iterative steps 3 and 4. They make a loop that searches for HPL parameters (specified in HPL.dat) that enable you to maximize the performance of your cluster.

Install HPL and make sure HPL is functional on all the nodes.
(Optional) Run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes.

Compile nodeperf.c with your MPI and Intel MKL. For example:
```
mpiicc -O3 nodeperf.c -openmp -Wl,--start-group \ 
$MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a \ 
$MKLPATH/libmkl_core.a -Wl,--end-group -lpthread –o nodeperf
                
                
```
Launching nodeperf on all the nodes is especially helpful in a very large cluster. nodeperf enables quick identification of a potential problem spot without numerous small runs of the Intel Optimized MP LINPACK Benchmark around the cluster in search of a bad node. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by the host identifier. Therefore, the higher the DGEMM performance, the faster that node was performing.
Edit HPL.dat to fit your cluster needs.
See the HPL documentation for more information. Note, however, that you should use at least 4 nodes.
Make an HPL run, using compile options such as ASYOUGO, ASYOUGO2, or ENDEARLY to aid in your search. These options enable you to gain insight into the performance sooner than HPL would normally give this insight.

When doing so, follow these recommendations:
- Use the Intel Optimized MP LINPACK Benchmark, which is a patched version of HPL, to save time in the search.
  
  All the features impacting performance are optional in the Intel Optimized MP LINPACK Benchmark. That is, if you do not use the new options to reduce search time, these features are disabled. The primary purpose of the additions is to assist you in finding solutions.
  
  While HPL requires a long time to search for many different parameters, in the Intel Optimized MP LINPACK Benchmark, the goal is to get the best possible number.
  
  Given that the input is not fixed, there is a large parameter space you must search over. An exhaustive search of all possible inputs is improbably large even for a powerful cluster. The Intel Optimized MP LINPACK Benchmark optionally prints information on performance as it proceeds. You can also terminate early.
- Save time by compiling with -DENDEARLY -DASYOUGO2 and using a negative threshold (do not use a negative threshold on the final run that you intend to submit as a TOP500 entry). Set the threshold in line 13 of the HPL 2.1 input file HPL.dat.
- If you are going to run a problem to completion, do it with -DASYOUGO.
Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Optimizing the Result on a Cluster

See Also