Options to Reduce Search Time

Running large problems to completion on large numbers of nodes can take many hours. The search space for the Intel Optimized LINPACK Benchmark is also large: you can vary several parameters to improve performance, such as problem size, block size, grid layout, lookahead steps, factorization methods, and so on. You might not want to run a large problem to completion only to discover that it ran 0.01% slower than your previous best problem.

Use the following options to reduce the search time:

-DASYOUGO
-DENDEARLY,
-DASYOUGO2,

Use -DASYOUGO2 cautiously because it has a marginal performance impact. To see DGEMM internal performance, compile with -DASYOUGO2 and -DASYOUGO2_DISPLAY. These options provide useful DGEMM performance information at the cost of around 0.2% performance loss.

If you want to use the original HPL, simply omit these options and recompile from scratch. To do this, try "make arch=<arch> clean_arch_all".

-DASYOUGO

-DASYOUGO gives performance data as the run proceeds. The performance always starts off higher and then drops because the LU decomposition slows down as it goes. So the ASYOUGO performance estimate is usually an overestimate, but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where execution is in the LU decomposition that Intel Optimized LINPACK Benchmark performs, and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. Refer to the description of the DASYOUGO2 option below for the details of the output.

-DENDEARLY

-DENDEARLY terminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You can define both, but it is not necessary. To avoid the residual check for a problem that terminates early, set the threshold parameter in HPL.dat to a negative number when testing ENDEARLY. It also sometimes gives more information to compile with -DASYOUGO2 when using -DENDEARLY.

Usage notes on -DENDEARLY follow:

-DENDEARLY stops the problem after a few iterations of DGEMM on the block size (the bigger the blocksize, the further it gets). It prints only five or six updates, whereas -DASYOUGO prints about 46 or so output elements before the problem completes.
Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (reflecting the progress of the LU decomposition). -DENDEARLY is likely to terminate before it starts to slow down.
-DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong because the problem never completed. However, you can get an idea what the initial performance was, and if it is acceptable, run the problem to completion without -DENDEARLY. To avoid the error check, you can set the threshold parameter in HPL.dat to a negative number.
Though -DENDEARLY terminates early, HPL treats the problem as completed and computes GFLOP rating as though the problem ran to completion. Ignore this erroneously high rating.
The bigger the problem, the more accurately the last update that -DENDEARLY returns is close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you should use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting.

-DASYOUGO2

-DASYOUGO2 gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal performance overhead. Unlike -DASYOUGO, which does not impact performance, -DASYOUGO2 interrupts every DGEMM call to monitor its performance. You should be aware of this overhead, although for big problems, it is less than 0.2%.

A sample ASYOUGO2 output appears as follows:

Col=001280 Fract=0.050 Mflops=42454.99 (DT=9.5 DF=34.1 DMF=38322.78).

Note

The values of Col, Fract, and Mflops are also produced for ASYOUGO and ENDEARLY.

In this example, the problem size is 16000 and a block size is 128. After processing 10 blocks, or 1280 columns (Col), an output was sent to the screen. Here, the fraction of columns completed (Fract) is 1280/16000=0.08. Only up to 111 outputs are printed, at various places through the matrix decomposition: 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 0.365 0.370 0.375 0.380 0.385 0.390 0.395 0.400 0.405 0.410 0.415 0.420 0.425 0.430 0.435 0.440 0.445 0.450 0.455 0.460 0.465 0.470 0.475 0.480 0.485 0.490 0.495 0.515 0.535 0.555 0.575 0.595 0.615 0.635 0.655 0.675 0.695 0.795 0.895.

However, this problem size is so small and the block size so big by comparison that as soon as it prints the value for 0.045, it was already through 0.08 fraction of the columns. On a really big problem, the fractional number is more accurate.

-DASYOUGO2 never prints more than the 112 numbers above. So, smaller problems have fewer than 112 updates, and the biggest problems have precisely 112 updates.

Mflops is an estimate based on the number of columns of the LU decomposition being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs.

The three parenthesized numbers are ASYOUGO2 add-ins that impact performance. DT is the total time that processor 0 has spent in DGEMM. DF is the number of billion operations that have been performed in DGEMM by one processor. Therefore, the performance of processor 0 (in GFLOPs) in DGEMM is always DF/DT. Using the number of DGEMM FLOPs as a basis instead of the number of LU FLOPs, you get a lower bound on performance of the run by looking at DMF, which can be compared to Mflops above.

Note that when using the performance monitoring tools described in this section to compare different HPL.dat input data sets, you should be aware that the pattern of performance drop-off that LU experiences is sensitive to sizes of input data. For instance, when you try very small problems, the performance drop-off from the initial values to end values is very rapid. The larger the problem, the less the drop-off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop-off is the relationship of the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps.

Use of these tools can increase the amount of data you can test.

Options to Reduce Search Time

-DASYOUGO

-DENDEARLY

-DASYOUGO2

Note

See Also