The LINPACK Benchmark: Past, Present, and Future y y z Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet December 2001 1 LINPACK Benchmark History 1.1 Introduction In this paper we will clear up some of the confusion and mystery surrounding the LINPACK Benchmark [57] and some of its variations. We will cover the \original" LINPACK 100 benchmark, the TOP500 [59], the HPL [64] program that can be used in the TOP500 measurement. We will examine what is measured and describe how to interpret the results of the programs' execution. But rst a bit of history. The original LINPACK Benchmark is, in some sense, an accident. It was originally designed to assist users of the LINPACK package [25] by providing information on execution times required to solve a system of linear equations. The rst \LINPACK Benchmark" report appeared as an appendix in the LINPACK Users' Guide [25] in 1979. The appendix comprised data for one commonly used path in the LINPACK software package. Results were provided for a matrix problem of size 100, on a collection of widely used computers (23 computers in all). This was done so users could estimate the time required to solve their matrix problem by extrapolation. Over the years additional performance data was added, more as a hobby than anything else, and today the collection includes over 1300 dierent computer systems. In addition to the number of computers increasing, the scope of the benchmark has also expanded. The benchmark report describes the performance for solving a general dense matrix problem Ax = b at three levels of problem size and optimization opportunity: 100 by 100 problem (inner loop optimization), 1000 by 1000 problem (three loop optimization - the whole program), and a scalable parallel problem. 1.2 The LINPACK Package The LINPACK package is a collection of Fortran subroutines for solving various systems of linear equations. The software in LINPACK is based on a decompositional approach to numerical linear algebra. The general idea is the following. Given a problem involving a matrix, A, one factors or decomposes A into a product of simple, well-structured matrices which can be easily manipulated to solve the original problem. The package has the capability of handling many dierent matrix types and dierent data types, and provides a range of options. This research was supported in part by the Applied Mathematical Sciences Research Program of the OÆce of Mathematical, Information, and Computational Sciences, U.S. Department of Energy under contract DE-AC0500OR22725 with UT-Battelle, LLC and the SciDAC PERC DE-FC02-01ER25490 and NSF Alliance Subcontract 790, Prime-ACI-961-019-07. y University of Tennessee, Department of Computer Science, Knoxville, TN 37996-3450, U.S.A., Phone: (+865)9748295, Fax: (+865)974-8296, E-mail: fdongarra,[email protected] z Sun Microsystems, Inc., Paris, France 1 The package itself is based on another package, called the Level 1 Basic Linear Algebra Subroutines (BLAS) [53]. The Level 1 BLAS address simple vector operations, such as adding a multiple of a vector to another vector (SAXPY) or forming an inner product (SDOT). Most of the oating-point work within the LINPACK algorithms is carried out by the BLAS, which makes it possible to take advantage of special computer hardware without having to modify the underlying algorithm. This approach thus achieves transportability and clarity of software without sacricing reliability. 1.3 Selection of the Algorithm Solving a system of equations requires O(n3 ) oating-point operations, more specically, 2=3n3 + 2n2 + O(n) oating-point additions and multiplications. Thus, the time required to solve such problems on a given machine can be approximated by timen = time100 n3 : 1003 In the LINPACK Benchmark, a matrix of size 100 was originally used because of memory limitations with the computer that were in use in 1979; the matrix has O(1002 ) oating-point elements and could be accommodated in most environments of that time. At the time it represented a \large enough" problem. (We will always assume 64-bit oating point arithmetic.) The algorithm used in the timings is based on LU decomposition with partial pivoting. The matrix type is real, general, and dense, with matrix elements randomly distributed between 1 and 1. (The random number generator used in the benchmark is not sophisticated; rather its major attribute is its compactness.) 2 2.1 The LINPACK Benchmark Operations The LINPACK benchmark features two routines: DGEFA and DGESL (these are the double precision versions, usually 64-bit oating point arithmetic, SGEFA and SGESL are the single-precision counter parts, usually 32 bit oating point arithmetic); DGEFA performs the decomposition with partial pivoting, and DGESL uses that decomposition to solve the given system of linear equations. Most of the execution time - O(n3 ) oating-point operations - is spent in DGEFA. Once the matrix has been decomposed, DGESL is used to nd the solution; this requires O(n2 ) oating-point operations. DGEFA and DGESL in turn call three BLAS routines: DAXPY, IDAMAX, and DSCAL. By far the major portion of time - over 90% at order 100 - is spent in DAXPY. DAXPY is used to multiply a scalar, , times a vector, x, and add the results to another vector, y. It is called approximately n2 =2 times by DGEFA and 2n times by DGESL with vectors of varying length. The statement yi yi + xi , 3 2 which forms an element of the DAXPY operation, is executed approximately n =3 + n times, which gives rise to roughly 2=3n3 oating-point operations in the solution. Thus, the n = 100 benchmark requires roughly 2/3 million oating-point operations. The statement yi yi + xi , besides the oating-point addition and oating-point multiplication, involves a few one-dimensional index operations and storage references. While the LINPACK routines DGEFA and DGESL involve two-dimensional arrays references, the BLAS refer to one-dimensional arrays. The LINPACK routines in general have been organized to access two-dimensional arrays by column. In DGEFA, the call to DAXPY passes an address into the twodimensional array A, which is then treated as a one-dimensional reference within DAXPY. Since the indexing is down a column of the two-dimensional array, the references to the one-dimensional array 2 are sequential with unit stride. This is a performance enhancement over, say, addressing across the column of a two-dimensional array. Since Fortran dictates that two-dimensional arrays be stored by column in memory, accesses to consecutive elements of a column lead to simple index calculations. References to consecutive elements dier by one word instead of by the leading dimension of the two-dimensional array. 2.2 Detailed Operation Counts The results reect only one problem area: solving dense systems of equations using the LINPACK programs in a Fortran environment. Since most of the time is spent in DAXPY, the benchmark is really measuring the performance of DAXPY. The average vector length for the algorithm used to compute LU decomposition with partial pivoting is 2=3n. Thus in the benchmark with n = 100, the average vector length is 66. In order to solve this matrix problem it is necessary to perform almost 700,000 oating point operations. The time required to solve the problem is divided into this number to determine the megaops rate. The routines DGEFA calls IDAMAX, DSCAL and DAXPY. Routine IDAMAX, which computes the index of a vector with largest modulus, is called 99 times, with a vector lengths running from 2 to 100. Each call to IDAMAX gives rise to n double precision absolute value computation and n 1 double precision comparisons. The total number of operations is 5364 double precision absolute values and 4950 double precision comparisons. DSCAL is called 99 times, with vector lengths running from 1 to 99. Each call to DSCAL performs n double precision multiplies, for a total of 4950 multiplies. DAXPY does the bulk of the work. It is called n times, where n varies from 1 to 99. Each call to DAXPY gives rise to one double precision comparison with zero, n double precision additions, and n double precision multiplications. This leads to 4950 comparisons against 0, 328350 additions and 328350 multiplications. In addition, DGEFA itself does 99 double precision comparisons against 0, and 99 double precision reciprocals. The total operation count for DGEFA is given in Table 1. Operation Operation type count add 328350 multiply 333300 reciprocal 99 absolute value 5364 4950 6= 0 5247 Table 1: Double precision operations counts for LINPACK 100's DGEFA routine. Routine DGESL, which is used to solve a system of equations based on the factorization from DGEFA, does much more modest amount of oating point operations. In DGESL, DAXPY is called in two places, once with vector lengths running from 1 to 99 and once with vector lengths running from 0 to 99. This leads to a total of 9900 double precision additions, the same number of double precision multiplications, and 199 compares against 0. DGESL does 100 divisions and 100 negations as well. The total operation count for DGESL is given in Table 2. This leads to a total operation count for the LINPACK benchmark given in Table 3 or a grand total of 697509 oating point operations. (The LINPACK uses approximately 2=3n3 + 2n2 3 Operation Operation type count add 9900 multiply 9900 divide 100 negate 100 6= 0 199 Table 2: Double precision operations counts for LINPACK 100's DGESL routine. Operation Operation type count add 338250 multiply 343200 reciprocal 99 divide 100 negate 100 absolute value 5364 4950 6= 0 5446 Total 697509 Table 3: Double precision operations counts for LINPACK 100 benchmark. operations, which for n = 100 has the value of 686667.) It is instructive to look just at the contribution due to DAXPY. Of these oating point operations, the calls to DAXPY from DGEFA and DGESL account for a total of 338160 adds, 338160 multiplies, and 5147 comparisons with zero. This gives a total of 681467 operations, or over 97% of all the oating point operations that are executed. The total time is taken up with more than arithmetic operations. In particular, there is quite a lot of time spent loading and storing the operands of the oating point operations. We can estimate the number of loads and stores by assuming that all operands must be loaded into registers, but also assuming that the compiler will do a reasonable job of promoting loop-invariant quantities out of loops, so that they need not be loaded each time within the loop. Then DAXPY accounts for 681468 double precision loads and 338160 double precision stores. IDAMAX accounts for 4950 loads, DSCAL for 5049 loads and 4950 stores, DGEFA outside of the loads and stores in the BLAS does 9990 loads and 9694 stores and DGESL for 492 loads and 193 stores. Thus, the total number of loads is 701949 and 352997 stores. Here again DAXPY dominates the statistics. The other overhead that must be accounted for is the load indexing, address arithmetic, and the overhead of argument passing and calls. 2.3 Precision In discussions of scientic computing, one normally assumes that oating-point computations will be carried out to full precision or 64-bit oating point arithmetic. Note that this is not an issue of single or double precision as some system have 64-bit oating point arithmetic as single precision. It is a function of the arithmetic used. 4 2.4 Loop Unrolling It is frequently observed that the bulk of the central processor time for a program is localized in 3% or less of the source code [62]. Often the critical code (from a timing perspective) consists of one or a few short inner loops typied, for instance, by the scalar product of two vectors. On scalar computers a simple technique for optimizing of such loops should then be most welcome. \Loop unrolling" (a generalization of \loop doubling") applied selectively to time-consuming loops is just such a technique [31, 52]. When a loop is unrolled, its contents are replicated one or more times, with appropriate adjustments to array indices and loop increments. Consider, for instance, the DAXPY sequence, which adds a multiple of one vector to a second vector: DO 10 i = 1,n y(i) = y(i) + alpha*x(i) 10 CONTINUE Unrolled to a depth of four, it would assume the following form: m = n - MOD(n, 4) DO 10 i = 1, m, 4 y(i) = y(i) y(i+1) = y(i+1) y(i+2) = y(i+2) y(i+3) = y(i+3) 10 CONTINUE + + + + alpha*x(i) alpha*x(i+1) alpha*x(i+2) alpha*x(i+3) In this recoding, four terms are computed per loop iteration, with the loop increment modied to count by fours. Additional code has to be added to process the MOD(n, 4) elements remaining upon completion of the unrolled loop, should the vector length not be a multiple of the loop increment. The choice of four was for illustration, with the generalization to other orders obvious from the example. Actual choice of unrolling depth in a given instance would be guided by the contribution of the loop to total program execution time and by consideration of architectural constraints. Why does unrolling enhance loop performance? First, there is the direct reduction in loop overhead { the increment, test, and branch function { which, for short loops, may actually dominate execution time per iteration. Unrolling simply divides the overhead by a factor equal to the unrolling depth, although additional code required to handle \leftovers" will reduce this gain somewhat. Clearly, savings should increase with increasing unrolling depth, but the marginal savings fall o rapidly after a few terms. The reduction in overhead is the primary source of improvement on \simple" computers. Second, for advanced architectures employing segmented functional units, the greater density of non-overhead operations permits higher levels of concurrency within a particular segmented unit. Thus, in the DAXPY example, unrolling would allow more than one multiplication to be concurrently active on a segmented machine such as the IBM Power processor. Optimal unrolling depth on such machines might well be related to the degree of functional unit segmentation. Third, and related to the above, unrolling often increases concurrency between independent functional units on computers so equipped or ones with fused multiple-add instructions. Thus, in our DAXPY example, an IBM Power processor, which has independent multiplier and adder units, could obtain concurrency between addition for one element and multiplication for the following element, besides the segmentation concurrency obtainable within each unit. 5 However, machines with vector instructions and their compilers trying to detect vector operations, the unrolling technique had the opposite eect. The unrolling inhibited the detection of vector operations in the loop, the resulting vector code might have become scalar, and the performance degraded. 2.5 Performance The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of eort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for benchmark suites should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations. Performance is often measured in terms of Megaops, millions of oating point operations per second (Mop/s). We usually include both additions and multiplications in the count of Mop/s, and the reference to an operation is assumed to be on 64-bit operands. The manufacturer usually refers to peak performance when describing a system. This peak performance is arrived at by counting the number of oating-point additions and multiplications that can be completed in a period of time, usually the cycle time of the machine. As an example, a Pentium III with a cycle time of 750 MHz. During a cycle the results of either the adder or multiplier can be completed: 1 operation 750MHz = 750Mop/s: 1 cycle Table 4 shows the peak performance for a number of high-performance computers. Cycle Peak time Performance [MHz] [Mop/s] Intel Pentium III 750 750 Intel Pentium 4 1,700 1,700 Intel Itanium 800 3,200 AMD Athlon 1,200 2,400 Compaq Alpha 500 1,000 IBM RS/6000 450 1,800 NEC SX-5 250 8,000 Cray SV-1 300 1,200 Machine Table 4: Theoretical Peak Performance of various CPUs. By peak theoretical performance we mean only that the manufacturer guarantees that programs will not exceed these rates, sort of a speed of light for a given computer. At one time, a programmer had to go out of his way to code a matrix routine that would not run at nearly top eÆciency on any system with an optimizing compiler. Owing to the proliferation of exotic computer architectures, this situation is no longer true. 6 Peak LINPACK 100 System Machine Performance Performance EÆciency [Mop/s] [Mop/s] [%] Intel Pentium III 750 138 18.4 Intel Pentium 4 1,700 313 18.4 Intel Itanium 3,200 600 18.5 AMD Athlon 2,400 557 23.3 Compaq Alpha 1,000 440 44.0 IBM RS/6000 1,800 503 27.9 NEC SX-5 8,000 856 10.7 Cray SV-1 1,200 549 45.7 Table 5: LINPACK benchmark solving a 100 by 100 matrix problem. The LINPACK Benchmark illustrates this point quite well. In practice, as Table 5 shows, there may be a signicant dierence between peak theoretical and actual performance [24]. If we examine the algorithm used in LINPACK and look at how the data are referenced, we see that at each step of the factorization process there are vector operations that modify a full submatrix of data. This update causes a block of data to be read, updated, and written back to central memory. The number of oating-point operations is 2=3n3 , and the number of data references, both loads and stores, is 2=3n3 . Thus, for every add/multiply pair we must perform a load and store of the elements, unfortunately obtaining no reuse of data. Even though the operations are fully vectorized, there is a signicant bottleneck in data movement, resulting in poor performance. On vector computers this translates into two vector operations and three vector-memory references, usually limiting the performance to well below peak rates. On super-scalar computers this results in a large amount of data movement and updates. To achieve high-performance rates, this operationto-memory-reference rate must be higher. In some sense this is a problem with doing simple vector operations on a vector or super-scalar machine. The bottleneck is in moving data and the rate of execution is limited by this quantities. We can see this by examining the rate of data transfers and the peak performance. To understand why the performance is so poor, considering the basic operation performed, a DAXPY. There is one parameter, R1 , that reects the hardware performance of the idealized generic computer and gives a rst-order description of any real computer. This characteristic parameter is dened as follows: R1 - the maximum or asymptotic performance - the maximum rate of computation in units of equivalent scalar operations performed per second (Mop/s) [48]. The information on DAXPY and DDOT presented in Tables 6 and 7 was generated by running the following loops as in-line code and measuring the time to perform the operations: DAXPY DO 10 i = 1,n y(i) = y(i) + alpha * x(i) 10 CONTINUE DDOT DO 10 i = 1,n s = s + x(i) * y(i) 10 CONTINUE The Level 1 BLAS operate only on vectors. The algorithms as implemented tend to do more data movement than is necessary. As a result, the performance of the routines in LINPACK suers on high-performance computers where data movement is as costly as oating-point operations. 7 Machine Intel Pentium III Intel Pentium 4 Intel Itanium AMD Athlon Compaq Alpha IBM RS/6000 DAXPY R1 [Mop/s] 56 178 28 66 92 100 Peak [Mop/s] 750 1,700 3,200 2,400 1,000 1,800 Table 6: DAXPY's asymptotic and peak performance. R1 Peak [Mop/s] [Mop/s] Intel Pentium III 82 750 Intel Pentium 4 277.5 1,700 Intel Itanium 33 3,200 AMD Athlon 80 2,400 Compaq Alpha 150 1,000 IBM RS/6000 143 1,800 Machine DDOT Table 7: DDOT's asymptotic and peak performance 2.6 Restructuring Algorithms Today's computer architectures usually have multiple stages in the memory hierarchy. By restructuring algorithms to exploit this hierarchical organization, one can gain high performance. A hierarchical memory structure involves a sequence of computer memories, or caches, ranging from a small, but very fast memory at the bottom to a large, but slow memory at the top. Since a particular memory in the hierarchy (call it M ) is not as big as the memory at the next level (M 0 ), only part of the information in M 0 will be contained in M . If a reference is made to information that is in M , then it is retrieved as usual. However, if the information is not in M , then it must be retrieved from M 0 , with a loss of time. To avoid repeated retrieval, information is transferred from M 0 to M in blocks, the supposition being that if a program references an item in a particular block, the next reference is likely to be in the same block. Programs having this property are said to have locality of reference. Typically, there is a certain startup time associated with getting the rst memory reference in a block. This startup is amortized over the block move. To come close to gaining peak performance, one must optimize the use of the lowest level of memory (i.e., retain information as long as possible before the next access to main memory), obtaining as much reuse as possible. 2.7 Matrix-Vector Operations One approach to restructuring algorithms to exploit hierarchical memory involves expressing the algorithms in terms of matrix-vector operations. These operations have the benet that they can reuse data and achieve a higher rate of execution than the vector counterpart. In fact, the number of oating-point operations remains the same; only the data reference pattern is changed. This change results in a operation-to-memory-reference rate on vector computers of eectively 2 vector 8 oating-point operations and 1 vector-memory reference. The Level 2 BLAS were proposed in order to support the development of software that would be both portable and eÆcient across a wide range of machine architectures, with emphasis on vector-processing machines. Many of the frequently used algorithms of numerical linear algebra can be coded so that the bulk of the computation is performed by calls to Level 2 BLAS routines; eÆciency can then be obtained by utilizing tailored implementations of the Level 2 BLAS routines. On vector-processing machines one of the aims of such implementations is to keep the vector lengths as long as possible, and in most algorithms the results are computed one vector (row or column) at a time. In addition, on vector register machines performance is increased by reusing the results of a vector register, and not storing the vector back into memory. Unfortunately, this approach to software construction is often not well suited to computers with a hierarchy of memory (such as global memory, cache or local memory, and vector registers) and true parallel-processing computers. For those architectures it is often preferable to partition the matrix or matrices into blocks and to perform the computation by matrix-matrix operations on the blocks. By organizing the computation in this fashion we provide for full reuse of data while the block is held in the cache or local memory. This approach avoids excessive movement of data to and from memory and gives a surface-to-volume eect for the ratio of operations to data movement. In addition, on architectures that provide for parallel processing, parallelism can be exploited in two ways: (1) operations on distinct blocks may be performed in parallel; and (2) within the operations on each block, scalar or vector operations may be performed in parallel. 2.8 Matrix-Matrix Operations A set of Level 3 BLAS have been proposed; targeted at the matrix-matrix operations [26]. If the vectors and matrices involved are of order n, then the original BLAS include operations that are of order O(n), the extended or Level 2 BLAS provide operations of order O(n2 ), and the current proposal provides operations of order O(n3 ); hence the use of the term Level 3 BLAS. Such implementations can, we believe, be portable across a wide variety of vector and parallel computers and also eÆcient (assuming that eÆcient implementations of the Level 3 BLAS are available). The question of portability has been much less studied but we hope, by having a standard set of building blocks, research into this area will be encouraged. In the case of matrix factorization, one must perform matrix-matrix operations rather than matrix-vector operations [28, 30, 33]. There is a long history of block algorithms for such matrix problems. Both the NAG and the IMSL libraries, for example, include such algorithms (F01BTF and F01BXF in NAG; LEQIF and LEQOF in IMSL). Many of the early algorithms utilized a small main memory, with tape or disk as secondary storage [7, 17, 18, 23, 36, 58]. Similar techniques were later used to exploit common page-swapping algorithms in virtual-memory machines. Indeed, such techniques are applicable wherever there exists a hierarchy of data storage (in terms of access speed). Additionally, full blocks (and hence the multiplication of full matrices) might appear as a subproblem when handling large sparse systems of equations [23, 29, 37, 42]. More recently, several researchers have demonstrated the eectiveness of block algorithms on a variety of modern computer architectures with vector-processing or parallel-processing capabilities, on which potentially high performance can easily be degraded by excessive transfer of data between dierent levels of memory (vector registers, cache, local memory, main memory, or solid-state disks) [8, 9, 16, 17, 30, 33, 50, 65, 67]. Our own eorts have been twofold: First, we are attempting to recast the algorithms from linear algebra in terms of the Level 3 BLAS (matrix-matrix operations). This involves modifying the algorithm to perform more than one step of the decomposition process at a given loop iteration. 9 Second, to facilitate the transport of algorithms to a wide variety of architectures and to achieve high performance, we are isolating the computationally intense parts in high-level modules. When the architecture changes, we deal with the modules separately, rewriting them in terms of machinespecic operations; however, the basic algorithm remains the same. By doing so we can achieve the goal of a high operation-to-memory-reference ratio. Recently it has been shown that matrix-matrix operations can be exploited further. LU factorization, which is usually the method of choice for the LINPACK benchmark code, can be formulated recursively [45]. The recursive formulation performs better [70] than block algorithm [4]. It is due to a lower memory traÆc of the recursive method which is achieved through better utilization of Level 3 BLAS. 2.9 Concluding Remarks on the LINPACK Benchmark Over the past several years, the LINPACK Benchmark has evolved from a simple listing for one matrix problem to an expanded benchmark describing the performance at three levels of problem size on several hundred computers. The benchmark today is used by scientists worldwide to evaluate computer performance, particularly for innovative advanced-architecture machines. Nevertheless, a note of caution is needed. Benchmarking, whether with the LINPACK Benchmark or some other program, must not be used indiscriminately to judge the overall performance of a computer system. Performance is a complex issue, dependent on a variety of diverse quantities including the algorithm, the problem size, and the implementation. The LINPACK Benchmark provides three separate benchmarks that can be used to evaluate computer performance on a dense system of linear equations: the rst for a 100 by 100 matrix, the second for a 1000 by 1000 matrix. The third benchmark, in particular, is dependent on the algorithm chosen by the manufacturer and the amount of memory available on the computer being benchmarked. 3 The Parallel LINPACK Benchmark In the past several years, the emergence of Distributed Memory (DM) computers [48] and their potential for the numerical solution of Grand Challenge problems [22, 51, 60, 61] has led to extensive research in benchmarking. Examples of DM computers include the IBM Scalable POWERparallel SP-2, the Intel Paragon, the Cray T3E, Networks and Clusters of Workstations (NoWs and CoWs). The key feature they have achieved is scalable performance. These scalable parallel computers comprise an ensemble of Processing Units (PUs) where each unit consists of a processor, local memories organized in a hierarchical manner, and other supporting devices. These PUs are interconnected by a point-to-point (direct) or switch-based (indirect) network. Without modifying the basic machine architecture, these distributed memory systems are capable of proportional increases in performance as the number of PUs, their memory capacity and bandwidth, and the network and I/O bandwidth are increased. As of today, DM computers are still being produced and their success is apparent when considering how common they have become. Still, their limitations have been revealed and their successors have already appeared. The latter are constructed from a small number of nodes, where each node is a small DM computer featuring a virtual shared memory. These nodes are interconnected by a simple bus- or crossbar-based interconnection network. Programming these machines as well as their production is facilitated by the relative simplicity of the interconnection network. In addition, increasing the computational capabilities of the PUs appears to be an easier task than increasing the performance of the network. As opposed to large scale DM computers where all processors are much less powerful than the whole, the collection of nodes of this hierarchical architecture is only slightly more powerful than its components. The 10 SGI SMP Power Challenge is an existing example of such an architecture. The scalability of these machines can simultaneously take advantage of the progresses made by the processor and network technologies as well as the hardware and/or software mechanisms implementing the virtual shared memory. It is still unclear how these machines will be programmed. Whether these machines will in the future completely replace DM computers is also a question that is diÆcult to answer today. In this paper, these machines will also be considered as DM computers. In order to fully exploit the increasing computational power of DM computers, the application software must be scalable, that is, able to take advantage of larger machine congurations to solve larger problems with the same eÆciency. The LINPACK benchmark addresses scalability by an introduction of a new category of testing rules and environment. This new category is referred to as a Highly-Parallel LINPACK (HPL) NxN benchmark [10, 13, 14, 15, 24, 71, 44, 74]. It requires solution of systems of linear equations by some method. The problem size is allowed to vary, and the best oating-point execution rate should be reported. In computing the execution rate, the number of operations should be 2n3 =3 + 2n2 independent of the actual method used. If Gaussian elimination is chosen, partial pivoting must be used. A residual for the accuracy of solution, given as kAx bk=(kAk kxk), should also be reported. The following quantities of the benchmark are reported in the TOP500 list: Rmax the performance in Gop/s for the largest problem run on a machine, Nmax the size of the largest problem run on a machine, N1=2 the size where half the Rmax execution rate is achieved, Rpeak the theoretical peak performance Gop/s for the machine. To the performance of the HPL NxN benchmark contributes eÆciency of the serial code executed on a single CPU as well as the parallel algorithm which makes all the CPUs cooperate. The former may be designed with the use of the optimization techniques mentioned earlier. The essential ingredients of the latter are load balancing and network latency hiding. Assuming that all of the CPUs are the same, an even work and data load may is usually achieved with a block-cyclic data layout [34, 47]. Latency hiding is most often performed with computation and communication overlapping which is facilitated by an appropriate choice of virtual CPU topology, computational sequences of LU factorization, and an implementation of collective communication algorithms. Large body of literature exists on eÆcient parallel implementation of BLAS [1, 2, 3, 20, 21, 38, 39, 40, 43, 49, 56, 72, 63], the factorization algorithm [5, 6, 11, 19, 34, 41, 47, 66, 68, 69, 73], and its components [12, 46, 54, 55]. 4 The TOP500 List Statistics on high-performance computers are of major interest to manufacturers, and potential users. They wish to know not only the number of systems installed, but also the location of the various supercomputers within the high-performance computing community and the applications for which a computer system is being used. Such statistics can facilitate the establishment of collaborations, the exchange of data and software, and provide a better understanding of the highperformance computer market. Statistical lists of supercomputers are not new. Every year since 1986, Hans Meuer [59] has published system counts of the major vector computer manufacturers, based principally on those at the Mannheim Supercomputer Seminar. Statistics based merely on the name of the manufacturer are no longer useful, however. New statistics are required that reect the diversication of 11 supercomputers, the enormous performance dierence between low-end and high-end models, the increasing availability of massively parallel processing (MPP) systems, and the strong increase in computing power of the high-end models of workstation suppliers (SMP). To provide this new statistical foundation, the TOP500 list was created in 1993 to assemble and maintain a list of the 500 most powerful computer systems. The list is compiled twice a year with the help of high-performance computer experts, computational scientists, manufacturers, and the Internet community in general. In the list, computers are ranked by their performance on the HPL NxN benchmark. 5 The HPL Code 5.1 Overview HPL [64] is a software package that generates and solves a random dense linear system of equations on distributed-memory computers as it is outlined in Fig. 1. The package uses 64-bit oating point arithmetic and portable routines for linear algebra operations and message passing. Also, it gives a possibility of selecting one of multiple factorization algorithms and provides timing and an estimate of accuracy of the solution. Consequently, it can be regarded as a portable implementation of the HPL NxN benchmark. It requires implementations of MPI and either BLAS or Vector Signal Image Processing Library (VSIPL). /* Generate and partition matrix data among MPI computing nodes. */ /* ... */ /* All the nodes start at the same time. */ MPI_Barrier(...); /* Start wall-clock timer. */ HPL_ptimer(...); HPL_pdgesv(...); /* Solve system of equations. */ /* Stop wall-clock timer. */ HPL_ptimer(...); /* Obtain the maximum wall-clock time. */ MPI_Reduce(...); /* Gather statistics about performance rate (base on the maximum wall-clock time) and accuracy of the solution. */ /* ... */ Figure 1: Main computational steps performed by HPL to obtain the HPL NxN benchmark rating. The HPL package is written in C and requires implementation of either the BLAS [26, 27] or 12 the Vector Signal Image Processing Library (VSIPL). 5.2 Algorithm HPL solves a linear system of equations of order n: Ax = b; A 2 Rnn ; x; b 2 Rn by rst computing LU factorization with row partial pivoting of the n by n + 1 coeÆcient matrix: P [A; b] = [[L; U ]; y]: Since the row pivoting (represented by the permutation matrix P ) and the lower triangular factor L are applied to b as the factorization progresses, the solution x is obtained in one step by solving the upper triangular system: Ux = y: The lower triangular matrix L is left unpivoted and the array of pivots is not returned. Fig. 2 shows 2-D cyclic block data distribution used by HPL. The data is distributed onto a two-dimensional grid (of dimensions P by Q) of processes according to the block-cyclic scheme to ensure good load balance as well as the scalability of the algorithm. The n by n + 1 coeÆcient matrix is logically partitioned into blocks (each of dimension NB by NB ), that are cyclically dealt onto the P by Q process grid. This is done in both dimensions of the matrix. P0 P2 P0 P2 P1 P3 P1 P3 P0 P2 P0 P2 P1 P3 P1 P3 Figure 2: Ownership of dense subblocks in two-dimensional block cyclic data distribution used by HPL. The number of processors is 4 (named P0, P1, P2, and P3), they are organized in 2 by 2 grid (P = Q = 2). The number of subblocks is 4 in both dimensions (N=NB = 4). The right-looking variant has been chosen for the main loop of the LU factorization. This means that, at each iteration of the loop, a panel of NB columns is factored, and the trailing submatrix is updated. Therefore, this computation is logically partitioned with the same block size NB that was used for the data distribution. At a given iteration of the main loop, and because of the cartesian property of the distribution scheme, each panel factorization occurs in one column of processes. This particular part of the computation lies on the critical path of the overall algorithm. For this operation, the user is oered a choice of three (Crout, left- and right-looking) recursive variants based on matrix-matrix multiply. The software also allows the user to choose in how many sub-panels the current panel should be divided at each recursion level. Furthermore, one can also select at run-time the recursion stopping criterion in terms of the number of columns left to factor. When this threshold is reached, the sub-panel will then be factored using one of the three (Crout, left- or right-looking) factorization algorithms based on matrix-vector operations. Finally, for each panel's column, the pivot search and the associated swap and broadcast operations of the pivot row are combined into one single communication step. A binary-exchange (leave-on-all) reduction performs these three operations at once. 13 Once the panel factorization has been performed, the factored panel of columns is broadcast to the other process columns. There are many possible broadcast algorithms and the software currently oers the following variants: Increasing ring, Modied increasing ring, Increasing two-ring, Modied increasing two-ring, Bandwidth-reducing, Modied bandwidth-reducing. The modied variants relieve the next processor (the one that would participate in factorization of the panel after the current one) from the burden of sending messages (otherwise it has to receive as well as send matrix update data). The ring variants propagate the update data in a single pipeline fashion, whereas the two-ring variants propagate data in two pipelines concurrently. The bandwidth-reducing variants divide a message to be sent into a number of pieces and scatters it across a single row of the grid of processors so that more messages are exchanged but the total volume of communication is independent of number of processors. This becomes particularly important when the computing nodes are relatively much faster than the interconnect. Once the current panel has been broadcast (or during the broadcast operation) the trailing submatrix has to be updated. As mentioned before, the panel factorization lies on the critical path. This means that when the kth panel has been factored and then broadcast, the next most urgent task to complete is the factorization and broadcast of the panel k +1. This technique is often referred to as a look-ahead (or send-ahead) in the literature. HPL allows to select various depths of look-ahead. By convention, a depth of zero corresponds to having no look-ahead, in which case the trailing submatrix is updated by the panel currently broadcast. Look-ahead consumes some extra memory to keep all the panels of columns currently in the look-ahead pipe. A look-ahead of depth 1 or 2 is most likely to achieve the best performance gain. The update of the trailing submatrix by the last panel in the look-ahead pipe is performed in two phases. First, the pivots must be applied to form the current row panel of U . Second, upper triangular solve using the column panel occurs. Finally, the updated part of U needs to be broadcast to each process within a single column so that the local rank update of size NB can take place. It has been decided to combine the swapping and broadcast of U at the cost of replicating the solve. The following algorithms are available for this communication operation: binary-exchange, bandwidth-reducing. The rst variant is a modied leave-on-all reduction operation. The second one has communication volume complexity that solely depends on the size of U (the number of process rows only impacts the number of messages being exchanged) and, consequently, should outperform the previous variant for large problems on large machine congurations. In addition, both of the previous variants may be combined in a single run of the code. After the factorization has ended, the backward substitution remains to be done. HPL uses look-ahead of depth one to do this. The right hand side is forwarded in process rows in a decreasingring fashion, so that we solve Q NB entries at a time. At each step, this shrinking piece of the 14 right-hand-side is updated. The process just above the one owning the current diagonal block of the matrix updates its last NB entries of vector x, forwards it to the previous process column, and then broadcasts it in the process column in a decreasing-ring fashion. The solution is then updated and sent to the previous process column. The solution of the linear system is left replicated in every process row. To verify the result, the input matrix and right-hand side are regenerated. The following scaled residuals are computed ( is the relative machine precision): rn = r1 = r1 = kAx bk1 kAk1 n kAx bk1 kAk1 kxk1 kAx bk1 kAk1 kxk1 A solution is considered numerically correct when all of these quantities are of order O(1). 5.3 Scalability HPL was designed for distributed-memory computers. They consist of processors that are connected using a message passing interconnection network. Each processor has its own memory called the local memory, which is accessible only to that processor. As the time to access a remote memory is longer than the time to access a local one, such computers are often referred to as Non-Uniform Memory Access (NUMA) machines. The interconnection network of our machine model is static, meaning that it consists of point-to-point communication links among processors. This type of network is also referred to as a direct network as opposed to dynamic networks. The latter are constructed from switches and communication links. These links are dynamically connected to one another by the switching elements to establish, at run-time, the paths between processors' memories. The interconnection network of the two-dimensional machine model considered here is a static, fully connected physical topology. It is also assumed that processors can be treated equally in terms of local performance and that the communication rate between two processors depends only on the processors being considered. Our model assumes that a processor can send or receive data on only one of its communication ports at a time (assuming it has more than one). In the literature, this assumption is also referred to as a one-port communication model. The time spent to communicate a message between two given processors is called the communication time Tc . In our machine model, Tc is approximated by a linear function of the number L of double precision oating point numbers being communicated. Tc is the sum of the time to prepare the message for transmission and the time L taken by the message of length L to traverse the network to its destination: Tc = + L: Finally, the model assumes that the communication links are bi-directional, i.e., the time for two processors to send each other a message of length L is also Tc . A processor can send and/or receive a message on only one of its communication links at a time. In particular, a processor can send a message while receiving another message from the processor it is sending to at the same time. Since this document is only concerned with regular local dense linear algebra operations, the time taken to perform one oating point operation is assumed to be modeled by three constants: 1 , 2 , and 3 . These quantities are single processor approximations of oating point operation (FLOP) rates of the vector-vector, matrix-vector and matrix-matrix operations, respectively. This 15 approximation describes all the steps performed by a processor to complete its portion of matrix factorization. Such a model neglects all the phenomena occurring in the processor components, such as cache misses, pipeline start-ups, memory load or store, oating point arithmetic, TLB misses etc. They may inuence the value of the constants and may vary as, e.g. a function of the problem size. Similarly, the model does not make any assumptions on the amount of physical memory available on a single node. It is assumed that if a process has been spawn on a processor, one has ensured that enough memory was available on that processor. In other words, swapping will not occur during the modeled computation. Consider an M by N panel distributed over a column of P processors. Because of the recursive formulation of the panel factorization, it is reasonable to consider that the oating point operations will be performed at matrix-matrix multiply rate. For every column in the panel a binary-exchange is performed on 2N data items. When this panel is broadcast, what matters is the time that the next process column will spend in this communication operation. Assuming one chooses the modied increasing ring variant, only one message needs to be taken into account. The execution time of the panel factorization and broadcast can thus be approximated by: M N MN Tpf act (M; N ) = 3 N 2 + N log P ( + 2N ) + + : P 3 P The update phase of an N by N trailing submatrix distributed on a P by Q process grid is achieved with a triangular solve of N right hand sides and a local update of rank NB of the trailing submatrix. Assuming that the long variant is chosen, the execution time of the update operation can be approximated by: Tupdate (N; NB ) = 3 ! N NB2 2N 2 NB + + (log P + P Q PQ 1) + 3N NB Q : The constant 3 in front of the term is obtained by counting one for the logarithmic spread phase and two for the rolling phase. In the case of bi-directional links this constant should be replaced by 2. The number of oating point operations performed during the backward substitution is given by 2 N =(P Q). Because of the look-ahead, the communication cost can be approximated at each step by two messages of length NB , i.e. the time to communicate the piece of size NB of the solution vector from one diagonal block of the matrix to another. It follows that the execution time of the backward substitution can be approximated by: Tbacks (N; NB ) = 2 N 2 +N + 2 : PQ NB The total execution time of the algorithm described above is given by: N X [Tpf act (N k =0 k NB ; NB ) + Tupdate (N (k 1) NB ; NB )] + Tbacks (N; NB ): By considering only the dominant term in , and 3 : THP L = 3 2N 3 N 2 (3P + Q) N ((NB + 1) log P + P ) + + : 3P Q 2P Q NB 16 The serial execution time is given by Tser = 23 3 N 3 . If we dene the parallel eÆciency E as the ratio Tser =(P Q THP L ), we obtain: E = 1+ 3(3P + Q) 3P Q((NB + 1) log P + P ) + 43 N 2N 2 NB 3 1 : This last equality shows that when the memory usage per processor N 2 =(P Q) is maintained constant, the parallel eÆciency slowly decreases only because of the term. The communication volume (the term), however, remains constant. Due to these results, HPL is said to be scalable not only with respect to the amount of computation, but also with respect to the communication volume. 5.4 Performance Results Tables 8 and 9 show system parameters and performance results of HPL for a cluster of workstations based on an AMD Athlon processor. Similarly, Tables 10 and 11 describe a cluster based on the Intel Pentium III processor. Performance tests were also performed on a Compaq cluster installed at Oak Ridge National Laboratory in Tennessee, U.S.A. This cluster is listed at position 90 on the June 2001 TOP500 list. Its description and performance is given in Tables 12 and 13. The LINPACK benchmark numbers for this system are presented in Table 14. CPU Main memory Interconnect OS C compiler C ags MPI BLAS Date AMD Athlon K7 500 Mhz 256 MB 100 Mbps Switched 2 NICs per node (channel bonding) RedHat Linux 6.2 (kernel 2.2.14) gcc ver. 2.91.66 (egcs-1.1.2 release) -fomit-frame-pointer -O3 -funroll-loops MPICH 1.2.1 ATLAS 3.0 beta September 2000 Table 8: Description of the AMD cluster used in tests. Processor Grid Matrix Dimension Dimension 2000 5000 8000 10000 1 by 4 1.28 1.73 1.89 1.95 2 by 2 1.17 1.68 1.88 1.93 4 by 1 0.81 1.43 1.70 1.80 Table 9: Performance (in Gop/s) of HPL on an AMD cluster with 4 computing nodes. 5.5 Running and Tuning In order to nd the best performance of a given system, the largest problem size tting in memory should be used. The amount of memory used by HPL is essentially the size of the coeÆcient matrix. 17 CPU Intel Pentium III 550 Mhz Main memory 512 MB Interconnect Myrinet OS RedHat Linux 6.1 (kernel 2.2.15) C compiler gcc ver. 2.91.66 (egcs-1.1.2 release) C ags -fomit-frame-pointer -O3 -funroll-loops MPI MPI GM ver. 1.2.3 BLAS ATLAS ver. 3.0 beta Date September 2000 Table 10: Description of the Pentium cluster used in tests. Processor Grid Matrix Dimension Dimension 2000 5000 8000 10000 15000 20000 2 by 4 1.76 2.32 2.51 2.58 2.72 2.73 4 by 4 2.27 3.94 4.46 4.68 5.00 5.16 Table 11: Performance (in Gop/s) of HPL on a Pentium cluster with up to 16 computing nodes. HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB , the better the load balance, so, consequently, one denitely should avoid very large values of NB . From a local computation point of view, too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the fastest level of the memory hierarchy. The number of messages will also increase. EÆcient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. 6 Acknowledgements The authors acknowledge the use of the Oak Ridge National Laboratory Compaq cluster, funded by the Department of Energy's OÆce of Science and Energy EÆciency programs and IBM supercomputer based on Power4 microprocessor provided as a part of the Department of Energy's Scientic Discovery through Advanced Computing (SCiDAC) program. The use of UltraSparc III server at the Department of Mathematics and Computer Science at Emory University is also greatly CPU EV67 667 Mhz OS True64 ver. 5 C compiler cc ver. 6.1 C ags -arch host -tune host -std -O5 MPI native (linker ags: -lmpi -lelan) BLAS CXML Date September 2000 Table 12: Description of the Compaq cluster used in tests. 18 Processor Grid Matrix Dimension Dimension 5000 10000 25000 53000 8 by 8 26.37 45.00.73 60.99 66.93 Table 13: Performance (in Gop/s) of HPL on a Compaq cluster with 64 computing nodes. CPUs/Nodes 1/1 4/1 16/4 64/16 256/64 N1=2 Nmax 150 6625 800 12350 2300 26500 5700 53000 14000 106000 Rmax [Gop/s] 1.14 4.36 17.0 67.5 263.6 E [%] 95.6 93.2 92.5 90.1 Table 14: LINPACK benchmark numbers for the Compaq cluster obtained using HPL. appreciated. References [1] M. Aboelaze, N. Chrisochoides, and E. Houstis. The parallelization of Level 2 and 3 BLAS operations on distributed-memory machines. Technical Report CSD-TR-91-007, Purdue University, West Lafayette, IN, 1991. [2] R. Agarwal, S. Balle, F. Gustavson, M. Joshi, and P. Palkar. A three-dimensional approach to parallel matrix multiplication. IBM Journal or Research and Development, 39(5):575{582, 1995. [3] R. Agarwal, F. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM Journal or Research and Development, 38(6):673{681, 1994. [4] E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. LAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, Third edition, 1999. [5] Cleve Ashcraft. The distributed solution of linear systems using the torus-wrap data mapping. Technical Report ECA-TR-147, Boeing Computer Services, Seattle, WA, 1990. [6] Cleve Ashcraft. A taxonomy of distributed dense LU factorization methods. Technical Report ECA-TR-161, Boeing Computer Services, Seattle, WA, 1991. [7] D. W. Barron and H. P. F. Swinnerton-Dyer. Solution of simultaneous linear equations using a magnetic tape store. Computer J., 3:28{33, 1990. [8] M. Berry, K. Gallivan, W. Harrod, W. Jalby, S. Lo, U. Meier, B. Philippe, and A. Sameh. Parallel algorithms on CEDAR system. Technical Report Report No. 581, CSRD, 1986. 19 [9] C. Bischof and Charles F. Van Loan. The WY representation for products of Householder matrices. SIAM SISSC, 8(2), March 1987. [10] R. Bisseling and L. Loyens. Towards peak parallel LINPACK performance on 400. Supercomputer, 45:20{27, 1991. [11] R. Bisseling and J. van der Vorst. Parallel LU decomposition on a transputer network. In Eds. G. van Zee and J. van der Vorst, editors, Lecture Notes in Computer Sciences, volume 384, pages 61{77. Springer-Verlag, 1989. [12] R. Bisseling and J. van der Vorst. Parallel triangular system solving on a mesh network of transputers. SIAM Journal on Scientic and Statistical Computing, 12:787{799, 1991. [13] J. Bolen, A. Davis, B. Dazey, S. Gupta, G. Henry, D. Robboy, G. Schier, D. Scott, M. Stallcup, A. Taraghi, S. Wheat (from Intel SSD), L. Fisk, G. Istrail, C. Jong, R. Riesen, and L. Shuler (from Sandia National Laboratories). Massively parallel distributed computing: World's rst 281 Gigaop supercomputer. In Proceedings of the Intel Supercomputer Users Group, 1995. [14] R. Brent. The LINPACK benchmark on the AP 1000. Frontiers, pages 128{135, 1992. [15] R. Brent and P. Strazdins. Implementation of BLAS Level 3 and LINPACK benchmark on the AP1000. Fujitsu Scientic and Technical Journal, 5(1):61{70, 1993. [16] I. Bucher and T. Jordan. Linear algebra programs for use on a vector computer with a secondary solid state storage device. In R. Vichnevetsky and R. Stepleman, editors, Advances in Computer Methods for Practical Dierential Equations, pages 546{550. IMACS, 1984. [17] D. A. Calahan. Block-oriented local-memory-based linear equation solution on the CRAY-2: Uniprocessor algorithms. In U. Schendel, editor, Proceedings International Conference on Parallel Processing, pages 375{378. IEEE Computer Society Press, August 1986. [18] B. Chartres. Adaption of the Jacobi and Givens methods for a computer with magnetic tape backup store. Technical Report 8, University of Sydney, 1960. [19] J. Choi, Jack J. Dongarra, Susan Ostrouchov, Antione Petitet, David Walker, and R. Clint Whaley. The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientic Programming, 5:173{184, 1996. [20] J. Choi, Jack J. Dongarra, and David Walker. Pumma: Parallel universal matrix multiplication algorithms on distributed-memory concurrent computers. Concurrency: Practice and Experience, 6(7):543{570, 1994. [21] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. van de Geijn. Parallel implementation of blas: General techniques for level 3 blas. Concurrency: Practice and Experience, 9(9):837{857, 1997. [22] Mathematical Committee on Physical and Engineering Sciences, editors. Grand Challenges: High Performance Computing and Communications. NSF/CISE, 1800 G Street NW, Washington, DC, 20550, 1991. [23] A. K. Dave and Iain S. Du. Sparse matrix calculations on the CRAY-2. Technical Report Report CSS 197, AERE Harwell, 1986. 20 [24] Jack J. Dongarra. Performance of various computers using standard linear equations software. Technical Report CS-89-85, University of Tennessee, 1989. (An updated version of this report can be found at http://www.netlib.org/benchmark/performance.ps). [25] Jack J. Dongarra, J. Bunch, Cleve Moler, and G. W. Stewart. LINPACK User's Guide. SIAM, Philadelphia, PA, 1979. [26] Jack J. Dongarra, J. Du Croz, Iain S. Du, and S. Hammarling. A set of Level 3 FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16:1{17, March 1990. [27] Jack J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14:1{17, March 1988. [28] Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA, 1998. [29] Jack J. Dongarra, Victor Eijkhout, and Piotr Luszczek. Recursive approach in sparse matrix LU factorization. In Proceedings of the 1st SGI Users Conference, pages 409{418, Cracow, Poland, October 2000. ACC Cyfronet UMM. Accepted for publication in Scientic Programming. [30] Jack J. Dongarra and T. Hewitt. Implementing dense linear algebra algorithms using mutlitasking on CRAY X-MP-4. SIAM J. Sci Stat Comp., 7(1):347{350, January 1986. [31] Jack J. Dongarra and A. Hinds. Unrolling loops in Fortran. Software-Practice and Experience, 9:219{226, 1979. [32] Jack J. Dongarra, Antoine Petitet, and Clint R. Whaley. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3{25, 2001. [33] Jack J. Dongarra and Danny C. Sorensen. Linear algebra on high-performance computers. In U. Schendel, editor, Proceedings Parallel Computing 85, pages 3{32. North Holland, 1986. [34] Jack J. Dongarra, R. van de Geijn, and David Walker. Scalability issues in the design of a library for dense linear algebra. Journal of Parallel and Distributed Computing, 22(3):523{537, 1994. (Also LAPACK Working Note No. 43). [35] Jack J. Dongarra and Clint R. Whaley. Automatically tuned linear algebra software (ATLAS). In Proceedings of SC'89 Conference, 1989. [36] J. DuCroz, S. Nugent, J. Reid, and D. Taylor. Solving large full sets of linear equations in a paged virtual store. ACM Transactions on Mathematical Software, 7(4):527{536, 1981. [37] Iain S. Du. Full matrix techniques in sparse Gaussian elimination. In Numerical Analysis Proceedings, number 912 in Lecture Notes in Mathematics, pages 71{84, Dundee, 1981. Springer-Verlag, Berlin, 1981. [38] A. Elster. Basic matrix subprograms for distributed-memory systems. In David Walker and Q. Stout, editors, Proceedings of the Fifth Distributed-Memory Computing Conference, pages 311{316. IEEE Press, 1990. 21 [39] R. Falgout, A. Skjellum, S. Smith, and C. Still. The multicomputer toolbox approach to concurrent BLAS and LACS. In Proceedings of the Scalable High Performance Computing Conference SHPCC-92. IEEE Computer Society Press, 1992. [40] G. Fox, S. Otto, and A. Hey. Matrix algorithms on a Hypercube I: Matrix multiplication. Parallel Computing, 3:17{31, 1987. [41] G. Geist and C. Romine. Lu factorization algorithms on distributed-memory multiprocessor architectures. SIAM Journal on Scientic and Statistical Computing, 9:639{649, 1988. [42] A. George and H. Rashwan. Auxiliary storage methods for solving nite element systems. SIAM SISSC, 6:882{910, 1985. [43] B. Grayson and R. van de Geijn. A high performance parallel Strassen implementation. Parallel Processing Letters, 6(1):3{12, 1996. [44] B. Greer and G. Henry. High performance software on Intel Pentium Pro processors or microops to TeraFLOPS. In Proceedings of the SuperComputing 1997 Conference, San Jose, California, 1997. ACM SIGARCH, IEEE Computer Society Press. ISBN: 0-89791-985-8. [45] Fred G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737{755, November 1997. [46] M. Heath and C. Romine. Parallel solution triangular systems on distributed-memory multiprocessors. SIAM Journal on Scientic and Statistical Computing, 9:558{588, 1988. [47] B. Hendrickson and D. Womble. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM Journal on Scientic and Statistical Computing, 15:1201{ 1226, 1994. [48] R. W. Hockney and C. R. Jesshope. Parallel Computers. Adam Hilger Ltd, Bristol, 1981. [49] S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang. Matrix multiplication on the Intel Touchstone DELTA. Concurrency: Practice and Experience, 6(7):571{594, 1994. [50] IBM. Engineering and Scientic Subroutine Library. IBM, 1986. Program Number: 5668-863. [51] W. Kaufmann and L. Smarr. Supercomputing and the Transformation of Science. Scientic American Library, 1993. [52] Donald Knuth. An empirical study of Fortran programs. Software-Practice and Experience, 1:105{133, 1971. [53] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms for Fortran usage. ACM Transactions on Mathematical Software, 5:308{323, 1979. [54] G. Li and T. Coleman. A parallel triangular solver for a distributed-memory multiprocessor. SIAM Journal on Scientic and Statistical Computing, 9(3):485{502, 1988. [55] G. Li and T. Coleman. A new method for solving triangular systems on distributed-memory message-passing multiprocessor. SIAM Journal on Scientic and Statistical Computing, 10(2):382{396, 1989. 22 [56] J. Li, R. Falgout, and A. Skjellum. A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies. Concurrency: Practice and Experience, 9(5):345{389, 1997. [57] The LINPACK 1000x1000 benchmark program. (See http://www.netlib.org/benchmark/ 1000d for source code.). [58] A. C. McKellar and E. G. Coman Jr. Organizing matrices and matrix operations for paged memory systems. Communications of the ACM, 12(3):153{165, 1969. [59] Hans W. Meuer, Erik Strohmaier, Jack J. Dongarra, and H.D. Simon. Top500 Supercomputer Sites, 17th edition edition, November 2 2001. (The report can be downloaded from http:// www.netlib.org/benchmark/top500.html). [60] OÆce of Science and Technology Policy, editors. A Research and Development Strategy for High Performance Computing. Executive OÆce of the President, 1987. [61] OÆce of Science and Technology Policy, editors. The Federal High Performance Computing Program. Executive OÆce of the President, 1989. [62] D. Pager. Some notes on speeding up certain loops by software, rmware, and hardware means. IEEE Trans. on Comp., pages 97{100, January 1972. [63] Antoine Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. Computer Science Department, University of Tennessee Knoxville, 1996. Ph.D. thesis. [64] Antoine Petitet, R. Clint Whaley, Jack J. Dongarra, and Andy Cleary. HPL { A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Innovative Computing Laboratory, September 2000. Available at http://icl.cs. utk.edu/hpl/ and http://www.netlib.org/hpl/. [65] Y. Robert and P. Suguazerro. The LU decomposition algorithm and its eÆcient fortran implementation on the IBM 3090 vector multiprocessor. Technical Report ECSEC Report ICE-0006, IBM, March 1987. [66] Youcef Saad. Communication complexity of the Gaussian elimination algorithm on multiprocessors. Linear Algebra and Its Applications, 77:315{340, 1986. [67] R. Schreiber. Engineering and Scientic Subroutine Library. Module Design Specication. SAXPY Computer Corporation, 255 San Geronimo Way, Sunnyvale, CA 94086, 1986. [68] P. Strazdins. Matrix factorization using distributed panels on the Fujitsu AP1000. In Proceed- ings of the IEEE First International Conference on Algorithms And Architectures for Parallel Processing ICA3PP-95, Brisbane, 1995. [69] P. Strazdins. Lookahead and algorithmic blocking techniques compared for parallel matrix factorization. In Proceedings of the 10th International Conference on Parallel and Distributed Computing and Systems, IASTED, Las Vegas, 1998. [70] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix. Anal. Appl., 18(4), 1997. 23 [71] R. van de Geijn. Massively parallel LINPACK benchmark on the Intel Touchstone DELTA and iPSC/860 systems. In 1991 Annual Users Conference Proceedings, Dallas, Texas, 1991. Intel Supercomputer Users Group. [72] R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience, 9(4):255{274, 1997. [73] E. van de Velde. Experiments with multicomputer LU-decomposition. Concurrency: Practice and Experience, 2:1{26, 1990. [74] D. Womble, D. Greenberg, D. Wheat, and S. Riesen. LU factorization and the LINPACK benchmark on the Intel Paragon. Technical report, Sandia National Laboratories, 1994. 24 Appendix Description In the following, a number of performance results is presented for variety of CPUs. The results show: asymptotic performance graphs of BLAS routines relevant to the HPL NxN benchmark, single-CPU LINPACK benchmark numbers for 1000 by 1000 problem, selected hardware parameters of the systems used in tests. The original LINPACK benchmark used Level 1 BLAS routines: DDOT and DAXPY. As the following graphs show, their performance is severely limited on contemporary superscalar processors. The Level 2 BLAS DGEMV routine may be used in an implementation of a LINPACK benchmark code. As the results indicate, however, this should also be avoided as its performance is limited by the system bus throughput. The Level 3 BLAS DGEMM routine achieves the highest fraction of peak performance and, consequently, should be the preferred BLAS to use for the LINPACK benchmark code. The LINPACK benchmark numbers for each of the CPUs were obtained by solving a system of 1000 simultaneous linear equations so they may be easily compared to each other. The label \Linpack source" refers to the self contained LINPACK benchmark code (available at http://www. netlib.org/benchmark/1000d) which relies only on compiler optimizations to increase the rate of execution. The code labeled \Linpack BLAS" requires external (possibly optimized) BLAS. The codes \Right-looking", \Left-looking", and \Crout" represent variants of LU factorization based on Level 2 BLAS routines. \Right-looking" relies on DGER and is equivalent to LAPACK's DGETF2 routine [4]. \Left-looking" variant uses DTRSV and DGEMV routines. And nally, \Crout" variant uses DGEMV routine only. Two codes that use Level 3 BLAS are \Lapack" (which solves the problem by using LAPACK's DGETRF and DGETRS routines [4]) and \Recursive LU" which calls DTRSM and DGEMM routines and is based on recursive formulation of LU factorization. Unless stated otherwise, all codes use BLAS provided by ATLAS [32, 35] version 3.2.1 or higher (available at http://www.netlib.org/atlas/ and http://math-atlas.sourceforge.net/). 25 Performance Results for Pentium III 550MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 75 42 DDOT DAXPY 70 40 65 38 MFLOP/s MFLOP/s 60 55 36 50 34 45 32 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 Vector dimension 400 500 600 700 800 900 1000 Vector dimension Level 2 BLAS: DGEMV Level 3 BLAS: DGEMM 120 450 DGEMV DGEMM 110 400 100 350 MFLOP/s MFLOP/s 90 80 300 250 70 200 60 150 50 40 100 0 200 400 600 800 1000 Matrix dimension 1200 1400 1600 1800 Linpack benchmark numbers bk Code MFLOP/s kAx bk kAkAx kkxkn Level 1 BLAS Linpack source 41.4 10 12 10.5 12 Linpack BLAS 55.4 10 10.5 Level 2 BLAS Right-looking 51.4 10 13 2.3 Left-looking 74.7 10 13 1.4 13 Crout 84.0 10 1.4 Level 3 BLAS Lapack 278.6 10 13 1.3 13 Recursive LU 324.6 10 1.3 26 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 550 Bus speed [MHz] 100 L1 cache [KB] 16+16 L2 cache [KB] 512 1400 1600 1800 2000 Performance Results for Pentium III 933MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 85 58 DDOT DAXPY 56 80 54 75 70 MFLOP/s MFLOP/s 52 65 50 48 46 60 44 55 42 50 40 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 200 700 DGEMV DGEMM 180 600 160 500 MFLOP/s MFLOP/s 140 120 400 100 300 80 200 60 40 100 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 49.3 10 12 Linpack BLAS 53.6 10 12 Level 2 BLAS Right-looking 86.1 10 13 Left-looking 144.7 10 13 Crout 115.7 10 13 Level 3 BLAS Lapack 410.2 10 13 Recursive LU 506.6 10 13 Code 1600 1800 kAx bk kAkkxkn 10.5 10.5 2.9 1.8 2.0 1.5 1.2 27 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 933 Bus speed [MHz] 133 L1 cache [KB] 16+16 L2 cache [KB] 256 1400 1600 1800 2000 Performance Results for Intel P4 1700MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 290 185 DDOT DAXPY 180 285 175 280 MFLOP/s MFLOP/s 170 275 270 165 160 265 155 260 150 255 145 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 450 3000 DGEMV DGEMM 400 2500 350 2000 MFLOP/s MFLOP/s 300 250 1500 200 1000 150 500 100 50 0 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 182.2 10 12 Linpack BLAS 238.8 10 12 Level 2 BLAS Right-looking 199.6 10 13 Left-looking 290.7 10 13 Crout 312.5 10 13 Level 3 BLAS Lapack 1262.0 10 13 Recursive LU 1393.0 10 13 Code 1600 1800 kAx bk kAkkxkn 10.5 10.5 2.2 1.7 2.8 8.4 7.2 28 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 1700 Bus speed [MHz] 400 L1 cache [KB] 12(I)+8(D) L2 cache [KB] 256 1400 1600 1800 2000 Performance Results for Intel/HP Itanium 800MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 33.5 27.4 DDOT DAXPY 27.2 33 27 32.5 MFLOP/s MFLOP/s 26.8 32 31.5 26.6 26.4 31 26.2 30.5 26 30 25.8 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 180 2200 DGEMV DGEMM 2000 160 1800 140 1600 1400 MFLOP/s MFLOP/s 120 100 80 1200 1000 800 600 60 400 40 200 20 0 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 39.3 10 12 Linpack BLAS 36.9 10 12 Level 2 BLAS Right-looking 40.5 10 13 Left-looking 108.5 10 12 Crout 123.7 10 13 Level 3 BLAS Lapack 624.2 10 13 Recursive LU 801.8 10 13 Code 1600 1800 kAx bk kAkkxkn 9.4 9.4 6.3 10.0 10.2 8.5 6.5 29 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 800 Bus speed [MHz] 100 L1 cache [KB] 16+16 L2 cache [KB] 96 L3 cache [MB] 2 1400 1600 1800 2000 Performance Results for AMD Athlon 1200MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 84 68 DDOT DAXPY 66 82 64 80 62 78 MFLOP/s MFLOP/s 60 76 74 58 56 72 54 70 52 68 50 66 48 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 240 1800 DGEMV DGEMM 220 1600 200 1400 180 1200 MFLOP/s MFLOP/s 160 140 120 1000 800 100 600 80 400 60 40 200 0 Code 200 400 600 800 1000 Matrix dimension 1200 MFLOP/s Level 1 BLAS Linpack 1000 (source) 63.0 Linpack 1000 (BLAS) 91.0 Level 2 BLAS Right-looking 124.3 Left-looking 173.7 Crout 137.0 Level 3 BLAS Lapack 835.8 Recursive LU 998.3 1400 1600 1800 kAx bk kAkkxkn kAx bk 10 10 12 10 10 10 13 10 10 13 0 10.5 10.5 12 6.6 7.4 8.8 13 13 1.6 1.1 13 30 200 400 600 800 1000 1200 Matrix dimension 1400 System parameters Clock rate [MHz] 1200 Bus speed [MHz] 200 L1 cache [KB] 64+64 L2 cache [KB] 256 1600 1800 2000 Performance Results for IBM Power3 375MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 150 110 DDOT DAXPY 140 100 130 90 MFLOP/s MFLOP/s 120 110 100 90 80 70 80 60 70 60 50 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 350 1600 DGEMV DGEMM 1400 300 1200 250 MFLOP/s MFLOP/s 1000 200 800 600 150 400 100 200 50 0 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 169.3 10 12 Linpack BLAS 165.9 10 12 Level 2 BLAS Right-looking 174.6 10 13 Left-looking 298.5 10 12 Crout 290.7 10 12 Level 3 BLAS Lapack 941.8 10 13 Recursive LU 1078.0 10 13 Code 1600 1800 kAx bk kAkkxkn 4.7 4.7 4.0 4.7 4.7 4.0 4.5 31 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 375 Bus speed [MHz] 100 L1 cache [KB] 32(I)+64(D) L2 cache [MB] 8 1400 1600 1800 2000 Performance Results for IBM Power4 1300MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 650 340 DAXPY 600 320 550 300 500 280 MFLOP/s MFLOP/s DDOT 450 260 400 240 350 220 300 200 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 1000 4500 DGEMV DGEMM 900 4000 800 3500 700 MFLOP/s MFLOP/s 3000 600 500 2500 2000 400 1500 300 1000 200 100 500 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 301.2 10 12 Linpack BLAS 294.6 10 12 Level 2 BLAS Right-looking 288.2 10 12 Left-looking 625.0 10 12 Crout 566.7 10 12 Level 3 BLAS Lapack 2157.0 10 12 Recursive LU 2388.0 10 13 Code 1600 1800 kAx bk kAkkxkn 4.7 4.7 5.8 6.4 4.8 5.8 6.8 32 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 1300 Bus speed [MHz] 333 L1 cache [KB] 64(I)+32(D) L2 cache [MB] 1.4 L3 cache [MB] 32 1400 1600 1800 2000 Performance Results for IBM Power4 1300MHz (ESSL BLAS) Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 600 400 DDOT DAXPY 550 350 500 450 300 MFLOP/s MFLOP/s 400 350 250 300 200 250 200 150 150 100 100 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 900 1000 Level 3 BLAS: DGEMM 1000 4000 DGEMV DGEMM 900 3500 800 3000 700 2500 MFLOP/s MFLOP/s 800 600 2000 500 1500 400 1000 300 500 0 Code 200 400 600 800 1000 Matrix dimension 1200 MFLOP/s Level 1 BLAS Linpack 1000 (source) 302.6 Linpack 1000 (BLAS) 331.0 Level 2 BLAS Right-looking 339.4 Left-looking 825.5 Crout 668.7 Level 3 BLAS Lapack 1000 2477.0 Recursive LU 1000 2786.0 1400 1600 1800 kAx bk kAkkxkn kAx bk 10 10 12 10 10 10 12 10 10 12 0 4.7 4.7 12 7.1 7.1 7.1 12 12 7.1 6.8 12 33 200 400 600 800 1000 1200 Matrix dimension 1400 1600 System parameters Clock rate [MHz] 1300 Bus speed [MHz] 333 L1 cache [KB] 64(I)+32(D) L2 cache [MB] 1.4 L3 cache [MB] 32 1800 2000 Performance Results for SGI Octane R12000 IP30 270MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 65 34 DDOT DAXPY 32 60 30 MFLOP/s MFLOP/s 55 50 28 26 45 24 40 22 35 20 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 100 500 DGEMV DGEMM 450 90 400 80 MFLOP/s MFLOP/s 350 70 60 300 250 50 200 40 150 30 100 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 60.8 10 13 Linpack BLAS 31.8 10 13 Level 2 BLAS Right-looking 38.1 10 13 Left-looking 93.0 10 13 Crout 88.5 10 13 Level 3 BLAS Lapack 336.0 10 13 Recursive LU 400.4 10 13 Code 1600 1800 kAx bk kAkkxkn 6.4 6.4 7.3 8.3 7.9 7.2 6.9 34 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 270 Bus speed [MHz] 100 L1 cache [KB] 32+32 L2 cache [MB] 2 1400 1600 1800 2000 Performance Results for Compaq/DEC Alpha 21264 EV67 500MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 160 100 DAXPY 150 90 140 80 130 70 MFLOP/s MFLOP/s DDOT 120 60 110 50 100 40 90 30 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 260 1000 DGEMV DGEMM 240 900 220 800 200 700 MFLOP/s MFLOP/s 180 160 140 600 500 120 400 100 300 80 60 200 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 187.5 10 13 Linpack BLAS 167.9 10 13 Level 2 BLAS Right-looking 159.2 10 13 Left-looking 188.4 10 13 Crout 214.6 10 13 Level 3 BLAS Lapack 565.1 10 13 Recursive LU 636.9 10 13 Code 1600 1800 kAx bk kAkkxkn 6.5 6.5 7.2 8.0 8.0 8.5 7.7 35 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 500 Bus speed [MHz] 333 L1 cache [KB] 64+64 L2 cache [MB] 4 1400 1600 1800 2000 Performance Results for Compaq/DEC Alpha 21164 533MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 26 28 DDOT DAXPY 25.5 26 25 24 MFLOP/s MFLOP/s 24.5 24 23.5 22 20 23 18 22.5 16 22 21.5 14 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 80 900 DGEMV DGEMM 75 800 70 700 65 600 MFLOP/s MFLOP/s 60 55 50 500 400 45 300 40 200 35 30 100 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 34.0 10 13 Linpack BLAS 48.1 10 13 Level 2 BLAS Right-looking 45.9 10 13 Left-looking 76.7 10 13 Crout 74.6 10 13 Level 3 BLAS Lapack 425.6 10 13 Recursive LU 500.9 10 13 Code 1600 1800 kAx bk kAkkxkn 6.5 6.5 5.8 8.1 7.3 8.1 6.5 36 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 533 Bus speed [MHz] 88 L1 cache [KB] 8+8 L2 cache [KB] 96 1400 1600 1800 2000 Performance Results for Sun UltraSparc III 750MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 130 100 DDOT DAXPY 95 120 90 MFLOP/s MFLOP/s 110 100 85 80 90 75 80 70 70 65 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 240 1100 DGEMV DGEMM 1000 220 900 200 800 180 700 MFLOP/s MFLOP/s 160 140 600 500 120 400 100 300 80 200 60 100 40 0 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 171.5 10 13 Linpack BLAS 123.8 10 13 Level 2 BLAS Right-looking 150.9 10 12 Left-looking 219.2 10 12 Crout 205.7 10 13 Level 3 BLAS Lapack 675.4 10 12 Recursive LU 734.8 10 12 Code 1600 1800 kAx bk kAkkxkn 6.5 6.5 9.9 9.9 7.6 9.9 9.6 37 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 750 Bus speed [MHz] 150 L1 cache [KB] 64(I)+32(D) L2 cache [MB] 1 1400 1600 1800 2000 Performance Results for Sun UltraSparc II 300MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 36 24 DDOT DAXPY 23 34 22 21 20 30 MFLOP/s MFLOP/s 32 28 19 18 17 26 16 24 15 22 14 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 60 500 DGEMV DGEMM 450 55 400 50 45 MFLOP/s MFLOP/s 350 40 300 250 200 35 150 30 100 25 50 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 42.2 10 13 Linpack BLAS 37.5 10 13 Level 2 BLAS Right-looking 37.1 10 13 Left-looking 66.8 10 13 Crout 65.9 10 13 Level 3 BLAS Lapack 285.8 10 13 Recursive LU 284.5 10 12 Code 1600 1800 kAx bk kAkkxkn 6.5 6.5 8.9 7.9 8.0 8.8 9.0 38 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 300 Bus speed [MHz] 100 L1 cache [KB] 16+16 L2 cache [MB] 2 1400 1600 1800 2000 Performance Results for Sun UltraSparc II 250MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 26 18 DDOT DAXPY 24 16 22 20 14 MFLOP/s MFLOP/s 18 16 14 12 10 12 8 10 8 6 6 4 4 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 32 200 DGEMV DGEMM 180 30 160 28 140 26 MFLOP/s MFLOP/s 120 24 22 100 80 20 60 18 40 16 20 14 0 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 21.4 10 13 Linpack BLAS 18.2 10 13 Level 2 BLAS Right-looking 18.4 10 13 Left-looking 28.0 10 12 Crout 28.8 10 12 Level 3 BLAS Lapack 109.3 10 13 Recursive LU 110.3 10 12 Code 1600 1800 kAx bk kAkkxkn 6.5 6.5 7.5 10.2 10.1 7.5 1.1 39 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 250 Bus speed [MHz] 83 L1 cache [KB] 16+16 L2 cache [MB] 1 1400 1600 1800 2000 Performance Results for PowerPC G4 533MHz Level 1 BLAS: DDOT Level 1 BLAS: DAXPY 85 55 DDOT DAXPY 75 45 MFLOP/s 50 MFLOP/s 80 70 40 65 35 60 30 0 100 200 300 400 500 600 Vector dimension 700 800 900 1000 0 100 200 300 Level 2 BLAS: DGEMV 400 500 600 Vector dimension 700 800 900 1000 Level 3 BLAS: DGEMM 200 700 DGEMV DGEMM 180 600 160 500 MFLOP/s MFLOP/s 140 120 400 100 300 80 200 60 40 100 0 200 400 600 800 1000 Matrix dimension 1200 1400 MFLOP/s kAx bk Level 1 BLAS Linpack source 52.6 10 12 Linpack BLAS 76.2 10 12 Level 2 BLAS Right-looking 85.2 10 13 Left-looking 125.2 10 13 Crout 120.6 10 13 Level 3 BLAS Lapack 412.8 10 13 Recursive LU 477.6 10 12 Code 1600 1800 kAx bk kAkkxkn 9.4 9.4 8.1 7.4 8.2 8.2 7.2 40 0 200 400 600 800 1000 1200 Matrix dimension System parameters Clock rate [MHz] 533 Bus speed [MHz] 100 L1 cache [KB] 32+32 L2 cache [MB] 2 1400 1600 1800 2000