More Details of the Intel Optimized MP LINPACK Benchmark

The Intel Optimized MP LINPACK Benchmark does Gaussian elimination with row pivoting to compute an LU decomposition of the matrix. If P is a permutation matrix representing row pivots, then PA = LU where L is a lower unit triangular matrix and U is an upper triangular matrix. The algorithm is blocked to increase cache reuse of the data. The sequential algorithm closest to the Intel Optimized MP LINPACK Benchmark is DGETRF from LAPACK or PDGETRF from ScaLAPACK, referred to by the generic name *GETRF. However, *GETRF retains L, which is not necessary in the Intel Optimized MP LINPACK Benchmark. A system of equations Ax = b can be solved with *GETRF by performing these steps:

Compute PA = LU
Solve Ly=Pb for y
Solve Ux = y for x

L to the left can be discarded if b is replaced with y above while going through the problem. This saves a forward solve at the end (which is not so critical), and it means that as long as you do an LU decomposition on the column-augmented system [A|b], when doing pivots you can skip pivoting to the left while using a right-looking algorithm.

Note

While it is acceptable in the Intel Optimized MP LINPACK Benchmark to skip pivoting to the left, the LAPACK and ScaLAPACK *GETRF algorithms must add pivoting to the left if step 2 might be used later.

*GETRF makes several BLAS calls. Assuming N is the problem size and NB is the column block size used in the blocking above, then as long as N is sufficiently greater than NB, most of the floating-point operations (FLOPs) are found in *GEMM. Some FLOPs may also be present in *TRSM. Although other BLAS calls may be necessary, these are the performance critical functions. *GETRF does computation in one of three spots: *GEMM, *TRSM, and a local LU factorization.

*GEMM overwrites C with
α * op(A) * op(B) + β * C.
α and β are both scalars, and A, B, and C are matrices. op(A) denotes A or the transpose of A and similarly for op(B). op(A) is an m x k matrix, op(B) is a k x n matrix, and C is an m x n matrix.

Assuming that N is the global problem size and that the matrix is distributed on a 2-dimensional (2D) block cyclic mapping on a P by Q processor grid with the block size NB, initial values of m, n, and k for the first few *GEMM calls are m≈N/P, n≈N/Q, and k = NB.

The number of FLOPs executed in *GEMM for each block iteration starts off at approximately 2*N*N*NB/(P*Q). The size of each *GEMM may decrease, depending on whether the node in question owns the previous block row or block column.

*TRSM solves a triangular system of equations of size NB by NB, with typically N/Q solution vectors to compute. It has roughly NB*NB*(N/Q) FLOPs. So, as long as N >> P*NB, there is more work in *GEMM than in *TRSM.

Additionally, each iteration does a local LU-factorization that starts at the size N/P by NB. Although the row pivoting along the entire column has to be done, the computation is the same as factoring an NBxNB matrix, which is approximately 2*NB*NB*NB/3 FLOPs. If N >> P*NB, there are fewer FLOPs in this computation than in *TRSM.

HPL requires row pivoting, which, while it does not involve FLOPs, can still be time consuming. It is not acceptable to replace the random matrix generator of HPL with a matrix that requires less pivoting (for example, a diagonally dominant matrix).

It is also necessary to get the data around the 2D PxQ grid since each node needs this data to do the work for row pivoting. Each node in the PxQ grid can be represented as the pair (i, j) where 0≤i<P, and 0≤j<Q.

The algorithm involves a horizontal broadcast and a vertical broadcast across the grid.

The horizontal broadcast (which contains pivot information and the A matrix of the above *GEMM) is usually ring broadcast along Q nodes. For example, node (2,3) has to get data to every other node in the block row given by (2, j). The reason for using a ring broadcast, as opposed to a tree broadcast, is that horizontal motion in the grid can often be overlapped with other operations. The broadcast itself consumes time, but some of this effect can be hidden, depending on the input parameters.

The vertical broadcast (which contains the B matrix of the above *GEMM) is usually a tree broadcast along P nodes because processor columns must be synchronized. In many HPL configurations, P ≤ 2Q is chosen.

For details of broadcasting in HPL, see www.netlib.org/benchmark/hpl/algorithm.html#bcast.

The other communication is the pivoting itself.

The offload portion of the code is typically some piece of the *GEMM or *TRSM because they have the highest ratio of FLOPs to data and data must move across the PCIe bus.

The fastest way to offload DGEMM is to send some portion of A and B to the coprocessor and have the coprocessor return C (assuming that β was zero) in an Intel Xeon Phi coprocessor native DGEMM call.

NB must be larger than is required for Intel® Xeon® processors alone. Choosing 400 for NB offers no acceleration through offloading. However, choosing a value larger than 960 for NB has proven to offer considerable acceleration.

Large values of NB require extra memory on the host processor and coprocessor. If this memory is low, the problem size N does not satisfy the inequality N>> NB. In that event, it is better not to use the offload version of the Intel Optimized MP LINPACK Benchmark.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804