The LINPACK Benchmark: Past, Present, and Future

The LINPACK Benchmark: Past, Present, and Future
y
y
z
Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet
December 2001
1
LINPACK Benchmark History
1.1
Introduction
In this paper we will clear up some of the confusion and mystery surrounding the LINPACK
Benchmark [57] and some of its variations. We will cover the \original" LINPACK 100 benchmark,
the TOP500 [59], the HPL [64] program that can be used in the TOP500 measurement. We will
examine what is measured and describe how to interpret the results of the programs' execution.
But rst a bit of history.
The original LINPACK Benchmark is, in some sense, an accident. It was originally designed
to assist users of the LINPACK package [25] by providing information on execution times required
to solve a system of linear equations. The rst \LINPACK Benchmark" report appeared as an
appendix in the LINPACK Users' Guide [25] in 1979. The appendix comprised data for one
commonly used path in the LINPACK software package. Results were provided for a matrix
problem of size 100, on a collection of widely used computers (23 computers in all). This was done
so users could estimate the time required to solve their matrix problem by extrapolation.
Over the years additional performance data was added, more as a hobby than anything else,
and today the collection includes over 1300 dierent computer systems. In addition to the number
of computers increasing, the scope of the benchmark has also expanded. The benchmark report
describes the performance for solving a general dense matrix problem Ax = b at three levels of
problem size and optimization opportunity: 100 by 100 problem (inner loop optimization), 1000
by 1000 problem (three loop optimization - the whole program), and a scalable parallel problem.
1.2
The LINPACK Package
The LINPACK package is a collection of Fortran subroutines for solving various systems of linear
equations. The software in LINPACK is based on a decompositional approach to numerical linear
algebra. The general idea is the following. Given a problem involving a matrix, A, one factors or
decomposes A into a product of simple, well-structured matrices which can be easily manipulated
to solve the original problem. The package has the capability of handling many dierent matrix
types and dierent data types, and provides a range of options.
This research was supported in part by the Applied Mathematical Sciences Research Program of the OÆce
of Mathematical, Information, and Computational Sciences, U.S. Department of Energy under contract DE-AC0500OR22725 with UT-Battelle, LLC and the SciDAC PERC DE-FC02-01ER25490 and NSF Alliance Subcontract 790,
Prime-ACI-961-019-07.
y University of Tennessee, Department of Computer Science, Knoxville, TN 37996-3450, U.S.A., Phone: (+865)9748295,
Fax: (+865)974-8296, E-mail: fdongarra,[email protected]
z Sun Microsystems, Inc., Paris, France
1
The package itself is based on another package, called the Level 1 Basic Linear Algebra Subroutines (BLAS) [53]. The Level 1 BLAS address simple vector operations, such as adding a multiple
of a vector to another vector (SAXPY) or forming an inner product (SDOT). Most of the oating-point
work within the LINPACK algorithms is carried out by the BLAS, which makes it possible to take
advantage of special computer hardware without having to modify the underlying algorithm. This
approach thus achieves transportability and clarity of software without sacricing reliability.
1.3
Selection of the Algorithm
Solving a system of equations requires O(n3 ) oating-point operations, more specically, 2=3n3 +
2n2 + O(n) oating-point additions and multiplications. Thus, the time required to solve such
problems on a given machine can be approximated by
timen =
time100 n3
:
1003
In the LINPACK Benchmark, a matrix of size 100 was originally used because of memory limitations
with the computer that were in use in 1979; the matrix has O(1002 ) oating-point elements and
could be accommodated in most environments of that time. At the time it represented a \large
enough" problem. (We will always assume 64-bit oating point arithmetic.)
The algorithm used in the timings is based on LU decomposition with partial pivoting. The
matrix type is real, general, and dense, with matrix elements randomly distributed between 1
and 1. (The random number generator used in the benchmark is not sophisticated; rather its major
attribute is its compactness.)
2
2.1
The LINPACK Benchmark
Operations
The LINPACK benchmark features two routines: DGEFA and DGESL (these are the double precision
versions, usually 64-bit oating point arithmetic, SGEFA and SGESL are the single-precision counter
parts, usually 32 bit oating point arithmetic); DGEFA performs the decomposition with partial
pivoting, and DGESL uses that decomposition to solve the given system of linear equations. Most of
the execution time - O(n3 ) oating-point operations - is spent in DGEFA. Once the matrix has been
decomposed, DGESL is used to nd the solution; this requires O(n2 ) oating-point operations.
DGEFA and DGESL in turn call three BLAS routines: DAXPY, IDAMAX, and DSCAL. By far the major
portion of time - over 90% at order 100 - is spent in DAXPY. DAXPY is used to multiply a scalar, ,
times a vector, x, and add the results to another vector, y. It is called approximately n2 =2 times
by DGEFA and 2n times by DGESL with vectors of varying length. The statement yi
yi + xi ,
3
2
which forms an element of the DAXPY operation, is executed approximately n =3 + n times, which
gives rise to roughly 2=3n3 oating-point operations in the solution. Thus, the n = 100 benchmark
requires roughly 2/3 million oating-point operations.
The statement yi
yi + xi , besides the oating-point addition and oating-point multiplication, involves a few one-dimensional index operations and storage references. While the
LINPACK routines DGEFA and DGESL involve two-dimensional arrays references, the BLAS refer to one-dimensional arrays. The LINPACK routines in general have been organized to access
two-dimensional arrays by column. In DGEFA, the call to DAXPY passes an address into the twodimensional array A, which is then treated as a one-dimensional reference within DAXPY. Since the
indexing is down a column of the two-dimensional array, the references to the one-dimensional array
2
are sequential with unit stride. This is a performance enhancement over, say, addressing across the
column of a two-dimensional array. Since Fortran dictates that two-dimensional arrays be stored by
column in memory, accesses to consecutive elements of a column lead to simple index calculations.
References to consecutive elements dier by one word instead of by the leading dimension of the
two-dimensional array.
2.2
Detailed Operation Counts
The results reect only one problem area: solving dense systems of equations using the LINPACK
programs in a Fortran environment. Since most of the time is spent in DAXPY, the benchmark is
really measuring the performance of DAXPY. The average vector length for the algorithm used to
compute LU decomposition with partial pivoting is 2=3n. Thus in the benchmark with n = 100,
the average vector length is 66.
In order to solve this matrix problem it is necessary to perform almost 700,000 oating point
operations. The time required to solve the problem is divided into this number to determine the
megaops rate.
The routines DGEFA calls IDAMAX, DSCAL and DAXPY. Routine IDAMAX, which computes the index
of a vector with largest modulus, is called 99 times, with a vector lengths running from 2 to 100.
Each call to IDAMAX gives rise to n double precision absolute value computation and n 1 double
precision comparisons. The total number of operations is 5364 double precision absolute values and
4950 double precision comparisons.
DSCAL is called 99 times, with vector lengths running from 1 to 99. Each call to DSCAL performs
n double precision multiplies, for a total of 4950 multiplies.
DAXPY does the bulk of the work. It is called n times, where n varies from 1 to 99. Each call to
DAXPY gives rise to one double precision comparison with zero, n double precision additions, and n
double precision multiplications. This leads to 4950 comparisons against 0, 328350 additions and
328350 multiplications. In addition, DGEFA itself does 99 double precision comparisons against 0,
and 99 double precision reciprocals. The total operation count for DGEFA is given in Table 1.
Operation
Operation
type
count
add
328350
multiply
333300
reciprocal
99
absolute value
5364
4950
6= 0
5247
Table 1: Double precision operations counts for LINPACK 100's DGEFA routine.
Routine DGESL, which is used to solve a system of equations based on the factorization from
DGEFA, does much more modest amount of oating point operations. In DGESL, DAXPY is called in
two places, once with vector lengths running from 1 to 99 and once with vector lengths running
from 0 to 99. This leads to a total of 9900 double precision additions, the same number of double
precision multiplications, and 199 compares against 0. DGESL does 100 divisions and 100 negations
as well. The total operation count for DGESL is given in Table 2.
This leads to a total operation count for the LINPACK benchmark given in Table 3 or a
grand total of 697509 oating point operations. (The LINPACK uses approximately 2=3n3 + 2n2
3
Operation Operation
type
count
add
9900
multiply
9900
divide
100
negate
100
6= 0
199
Table 2: Double precision operations counts for LINPACK 100's DGESL routine.
Operation
Operation
type
count
add
338250
multiply
343200
reciprocal
99
divide
100
negate
100
absolute value
5364
4950
6= 0
5446
Total
697509
Table 3: Double precision operations counts for LINPACK 100 benchmark.
operations, which for n = 100 has the value of 686667.)
It is instructive to look just at the contribution due to DAXPY. Of these oating point operations,
the calls to DAXPY from DGEFA and DGESL account for a total of 338160 adds, 338160 multiplies,
and 5147 comparisons with zero. This gives a total of 681467 operations, or over 97% of all the
oating point operations that are executed.
The total time is taken up with more than arithmetic operations. In particular, there is quite a
lot of time spent loading and storing the operands of the oating point operations. We can estimate
the number of loads and stores by assuming that all operands must be loaded into registers, but
also assuming that the compiler will do a reasonable job of promoting loop-invariant quantities
out of loops, so that they need not be loaded each time within the loop. Then DAXPY accounts
for 681468 double precision loads and 338160 double precision stores. IDAMAX accounts for 4950
loads, DSCAL for 5049 loads and 4950 stores, DGEFA outside of the loads and stores in the BLAS
does 9990 loads and 9694 stores and DGESL for 492 loads and 193 stores. Thus, the total number of
loads is 701949 and 352997 stores. Here again DAXPY dominates the statistics. The other overhead
that must be accounted for is the load indexing, address arithmetic, and the overhead of argument
passing and calls.
2.3
Precision
In discussions of scientic computing, one normally assumes that oating-point computations will
be carried out to full precision or 64-bit oating point arithmetic. Note that this is not an issue of
single or double precision as some system have 64-bit oating point arithmetic as single precision.
It is a function of the arithmetic used.
4
2.4
Loop Unrolling
It is frequently observed that the bulk of the central processor time for a program is localized in 3%
or less of the source code [62]. Often the critical code (from a timing perspective) consists of one
or a few short inner loops typied, for instance, by the scalar product of two vectors. On scalar
computers a simple technique for optimizing of such loops should then be most welcome. \Loop
unrolling" (a generalization of \loop doubling") applied selectively to time-consuming loops is just
such a technique [31, 52].
When a loop is unrolled, its contents are replicated one or more times, with appropriate adjustments to array indices and loop increments. Consider, for instance, the DAXPY sequence, which
adds a multiple of one vector to a second vector:
DO 10 i = 1,n
y(i) = y(i) + alpha*x(i)
10 CONTINUE
Unrolled to a depth of four, it would assume the following form:
m = n - MOD(n, 4)
DO 10 i = 1, m, 4
y(i) = y(i)
y(i+1) = y(i+1)
y(i+2) = y(i+2)
y(i+3) = y(i+3)
10 CONTINUE
+
+
+
+
alpha*x(i)
alpha*x(i+1)
alpha*x(i+2)
alpha*x(i+3)
In this recoding, four terms are computed per loop iteration, with the loop increment modied to
count by fours. Additional code has to be added to process the MOD(n, 4) elements remaining upon
completion of the unrolled loop, should the vector length not be a multiple of the loop increment.
The choice of four was for illustration, with the generalization to other orders obvious from the
example. Actual choice of unrolling depth in a given instance would be guided by the contribution
of the loop to total program execution time and by consideration of architectural constraints.
Why does unrolling enhance loop performance? First, there is the direct reduction in loop
overhead { the increment, test, and branch function { which, for short loops, may actually dominate
execution time per iteration. Unrolling simply divides the overhead by a factor equal to the unrolling
depth, although additional code required to handle \leftovers" will reduce this gain somewhat.
Clearly, savings should increase with increasing unrolling depth, but the marginal savings fall o
rapidly after a few terms. The reduction in overhead is the primary source of improvement on
\simple" computers.
Second, for advanced architectures employing segmented functional units, the greater density of
non-overhead operations permits higher levels of concurrency within a particular segmented unit.
Thus, in the DAXPY example, unrolling would allow more than one multiplication to be concurrently
active on a segmented machine such as the IBM Power processor. Optimal unrolling depth on such
machines might well be related to the degree of functional unit segmentation.
Third, and related to the above, unrolling often increases concurrency between independent
functional units on computers so equipped or ones with fused multiple-add instructions. Thus, in
our DAXPY example, an IBM Power processor, which has independent multiplier and adder units,
could obtain concurrency between addition for one element and multiplication for the following
element, besides the segmentation concurrency obtainable within each unit.
5
However, machines with vector instructions and their compilers trying to detect vector operations, the unrolling technique had the opposite eect. The unrolling inhibited the detection of vector
operations in the loop, the resulting vector code might have become scalar, and the performance
degraded.
2.5
Performance
The performance of a computer is a complicated issue, a function of many interrelated quantities.
These quantities include the application, the algorithm, the size of the problem, the high-level
language, the implementation, the human level of eort used to optimize the program, the compiler's
ability to optimize, the age of the compiler, the operating system, the architecture of the computer,
and the hardware characteristics. The results presented for benchmark suites should not be extolled
as measures of total system performance (unless enough analysis has been performed to indicate a
reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points
for further evaluations.
Performance is often measured in terms of Megaops, millions of oating point operations per
second (Mop/s). We usually include both additions and multiplications in the count of Mop/s,
and the reference to an operation is assumed to be on 64-bit operands.
The manufacturer usually refers to peak performance when describing a system. This peak
performance is arrived at by counting the number of oating-point additions and multiplications
that can be completed in a period of time, usually the cycle time of the machine. As an example,
a Pentium III with a cycle time of 750 MHz. During a cycle the results of either the adder or
multiplier can be completed:
1 operation
750MHz = 750Mop/s:
1 cycle
Table 4 shows the peak performance for a number of high-performance computers.
Cycle
Peak
time Performance
[MHz]
[Mop/s]
Intel Pentium III
750
750
Intel Pentium 4
1,700
1,700
Intel Itanium
800
3,200
AMD Athlon
1,200
2,400
Compaq Alpha
500
1,000
IBM RS/6000
450
1,800
NEC SX-5
250
8,000
Cray SV-1
300
1,200
Machine
Table 4: Theoretical Peak Performance of various CPUs.
By peak theoretical performance we mean only that the manufacturer guarantees that programs
will not exceed these rates, sort of a speed of light for a given computer. At one time, a programmer
had to go out of his way to code a matrix routine that would not run at nearly top eÆciency on any
system with an optimizing compiler. Owing to the proliferation of exotic computer architectures,
this situation is no longer true.
6
Peak
LINPACK 100 System
Machine
Performance
Performance EÆciency
[Mop/s]
[Mop/s]
[%]
Intel Pentium III
750
138
18.4
Intel Pentium 4
1,700
313
18.4
Intel Itanium
3,200
600
18.5
AMD Athlon
2,400
557
23.3
Compaq Alpha
1,000
440
44.0
IBM RS/6000
1,800
503
27.9
NEC SX-5
8,000
856
10.7
Cray SV-1
1,200
549
45.7
Table 5: LINPACK benchmark solving a 100 by 100 matrix problem.
The LINPACK Benchmark illustrates this point quite well. In practice, as Table 5 shows, there
may be a signicant dierence between peak theoretical and actual performance [24].
If we examine the algorithm used in LINPACK and look at how the data are referenced, we see
that at each step of the factorization process there are vector operations that modify a full submatrix
of data. This update causes a block of data to be read, updated, and written back to central
memory. The number of oating-point operations is 2=3n3 , and the number of data references,
both loads and stores, is 2=3n3 . Thus, for every add/multiply pair we must perform a load and
store of the elements, unfortunately obtaining no reuse of data. Even though the operations are
fully vectorized, there is a signicant bottleneck in data movement, resulting in poor performance.
On vector computers this translates into two vector operations and three vector-memory references,
usually limiting the performance to well below peak rates. On super-scalar computers this results in
a large amount of data movement and updates. To achieve high-performance rates, this operationto-memory-reference rate must be higher.
In some sense this is a problem with doing simple vector operations on a vector or super-scalar
machine. The bottleneck is in moving data and the rate of execution is limited by this quantities.
We can see this by examining the rate of data transfers and the peak performance.
To understand why the performance is so poor, considering the basic operation performed, a
DAXPY. There is one parameter, R1 , that reects the hardware performance of the idealized generic
computer and gives a rst-order description of any real computer. This characteristic parameter is
dened as follows:
R1 - the maximum or asymptotic performance - the maximum rate of computation in
units of equivalent scalar operations performed per second (Mop/s) [48].
The information on DAXPY and DDOT presented in Tables 6 and 7 was generated by running the
following loops as in-line code and measuring the time to perform the operations:
DAXPY
DO 10 i = 1,n
y(i) = y(i) + alpha * x(i)
10 CONTINUE
DDOT
DO 10 i = 1,n
s = s + x(i) * y(i)
10 CONTINUE
The Level 1 BLAS operate only on vectors. The algorithms as implemented tend to do more
data movement than is necessary. As a result, the performance of the routines in LINPACK suers
on high-performance computers where data movement is as costly as oating-point operations.
7
Machine
Intel Pentium III
Intel Pentium 4
Intel Itanium
AMD Athlon
Compaq Alpha
IBM RS/6000
DAXPY
R1
[Mop/s]
56
178
28
66
92
100
Peak
[Mop/s]
750
1,700
3,200
2,400
1,000
1,800
Table 6: DAXPY's asymptotic and peak performance.
R1
Peak
[Mop/s] [Mop/s]
Intel Pentium III
82
750
Intel Pentium 4
277.5
1,700
Intel Itanium
33
3,200
AMD Athlon
80
2,400
Compaq Alpha
150
1,000
IBM RS/6000
143
1,800
Machine
DDOT
Table 7: DDOT's asymptotic and peak performance
2.6
Restructuring Algorithms
Today's computer architectures usually have multiple stages in the memory hierarchy. By restructuring algorithms to exploit this hierarchical organization, one can gain high performance.
A hierarchical memory structure involves a sequence of computer memories, or caches, ranging
from a small, but very fast memory at the bottom to a large, but slow memory at the top. Since a
particular memory in the hierarchy (call it M ) is not as big as the memory at the next level (M 0 ),
only part of the information in M 0 will be contained in M . If a reference is made to information
that is in M , then it is retrieved as usual. However, if the information is not in M , then it must
be retrieved from M 0 , with a loss of time. To avoid repeated retrieval, information is transferred
from M 0 to M in blocks, the supposition being that if a program references an item in a particular
block, the next reference is likely to be in the same block. Programs having this property are said
to have locality of reference. Typically, there is a certain startup time associated with getting the
rst memory reference in a block. This startup is amortized over the block move.
To come close to gaining peak performance, one must optimize the use of the lowest level
of memory (i.e., retain information as long as possible before the next access to main memory),
obtaining as much reuse as possible.
2.7
Matrix-Vector Operations
One approach to restructuring algorithms to exploit hierarchical memory involves expressing the
algorithms in terms of matrix-vector operations. These operations have the benet that they can
reuse data and achieve a higher rate of execution than the vector counterpart. In fact, the number
of oating-point operations remains the same; only the data reference pattern is changed. This
change results in a operation-to-memory-reference rate on vector computers of eectively 2 vector
8
oating-point operations and 1 vector-memory reference.
The Level 2 BLAS were proposed in order to support the development of software that would
be both portable and eÆcient across a wide range of machine architectures, with emphasis on
vector-processing machines. Many of the frequently used algorithms of numerical linear algebra
can be coded so that the bulk of the computation is performed by calls to Level 2 BLAS routines;
eÆciency can then be obtained by utilizing tailored implementations of the Level 2 BLAS routines.
On vector-processing machines one of the aims of such implementations is to keep the vector lengths
as long as possible, and in most algorithms the results are computed one vector (row or column)
at a time. In addition, on vector register machines performance is increased by reusing the results
of a vector register, and not storing the vector back into memory.
Unfortunately, this approach to software construction is often not well suited to computers with
a hierarchy of memory (such as global memory, cache or local memory, and vector registers) and
true parallel-processing computers. For those architectures it is often preferable to partition the
matrix or matrices into blocks and to perform the computation by matrix-matrix operations on the
blocks. By organizing the computation in this fashion we provide for full reuse of data while the
block is held in the cache or local memory. This approach avoids excessive movement of data to and
from memory and gives a surface-to-volume eect for the ratio of operations to data movement. In
addition, on architectures that provide for parallel processing, parallelism can be exploited in two
ways: (1) operations on distinct blocks may be performed in parallel; and (2) within the operations
on each block, scalar or vector operations may be performed in parallel.
2.8
Matrix-Matrix Operations
A set of Level 3 BLAS have been proposed; targeted at the matrix-matrix operations [26]. If
the vectors and matrices involved are of order n, then the original BLAS include operations that
are of order O(n), the extended or Level 2 BLAS provide operations of order O(n2 ), and the
current proposal provides operations of order O(n3 ); hence the use of the term Level 3 BLAS. Such
implementations can, we believe, be portable across a wide variety of vector and parallel computers
and also eÆcient (assuming that eÆcient implementations of the Level 3 BLAS are available). The
question of portability has been much less studied but we hope, by having a standard set of building
blocks, research into this area will be encouraged.
In the case of matrix factorization, one must perform matrix-matrix operations rather than
matrix-vector operations [28, 30, 33]. There is a long history of block algorithms for such matrix
problems. Both the NAG and the IMSL libraries, for example, include such algorithms (F01BTF
and F01BXF in NAG; LEQIF and LEQOF in IMSL). Many of the early algorithms utilized a small
main memory, with tape or disk as secondary storage [7, 17, 18, 23, 36, 58]. Similar techniques
were later used to exploit common page-swapping algorithms in virtual-memory machines. Indeed,
such techniques are applicable wherever there exists a hierarchy of data storage (in terms of access
speed). Additionally, full blocks (and hence the multiplication of full matrices) might appear as a
subproblem when handling large sparse systems of equations [23, 29, 37, 42].
More recently, several researchers have demonstrated the eectiveness of block algorithms on a
variety of modern computer architectures with vector-processing or parallel-processing capabilities,
on which potentially high performance can easily be degraded by excessive transfer of data between
dierent levels of memory (vector registers, cache, local memory, main memory, or solid-state
disks) [8, 9, 16, 17, 30, 33, 50, 65, 67].
Our own eorts have been twofold: First, we are attempting to recast the algorithms from
linear algebra in terms of the Level 3 BLAS (matrix-matrix operations). This involves modifying
the algorithm to perform more than one step of the decomposition process at a given loop iteration.
9
Second, to facilitate the transport of algorithms to a wide variety of architectures and to achieve
high performance, we are isolating the computationally intense parts in high-level modules. When
the architecture changes, we deal with the modules separately, rewriting them in terms of machinespecic operations; however, the basic algorithm remains the same. By doing so we can achieve
the goal of a high operation-to-memory-reference ratio.
Recently it has been shown that matrix-matrix operations can be exploited further. LU factorization, which is usually the method of choice for the LINPACK benchmark code, can be formulated
recursively [45]. The recursive formulation performs better [70] than block algorithm [4]. It is due
to a lower memory traÆc of the recursive method which is achieved through better utilization of
Level 3 BLAS.
2.9
Concluding Remarks on the LINPACK Benchmark
Over the past several years, the LINPACK Benchmark has evolved from a simple listing for one
matrix problem to an expanded benchmark describing the performance at three levels of problem
size on several hundred computers. The benchmark today is used by scientists worldwide to evaluate
computer performance, particularly for innovative advanced-architecture machines.
Nevertheless, a note of caution is needed. Benchmarking, whether with the LINPACK Benchmark or some other program, must not be used indiscriminately to judge the overall performance
of a computer system. Performance is a complex issue, dependent on a variety of diverse quantities
including the algorithm, the problem size, and the implementation. The LINPACK Benchmark
provides three separate benchmarks that can be used to evaluate computer performance on a dense
system of linear equations: the rst for a 100 by 100 matrix, the second for a 1000 by 1000 matrix.
The third benchmark, in particular, is dependent on the algorithm chosen by the manufacturer and
the amount of memory available on the computer being benchmarked.
3
The Parallel LINPACK Benchmark
In the past several years, the emergence of Distributed Memory (DM) computers [48] and their potential for the numerical solution of Grand Challenge problems [22, 51, 60, 61] has led to extensive
research in benchmarking. Examples of DM computers include the IBM Scalable POWERparallel SP-2, the Intel Paragon, the Cray T3E, Networks and Clusters of Workstations (NoWs and
CoWs). The key feature they have achieved is scalable performance. These scalable parallel computers comprise an ensemble of Processing Units (PUs) where each unit consists of a processor,
local memories organized in a hierarchical manner, and other supporting devices. These PUs are
interconnected by a point-to-point (direct) or switch-based (indirect) network. Without modifying the basic machine architecture, these distributed memory systems are capable of proportional
increases in performance as the number of PUs, their memory capacity and bandwidth, and the
network and I/O bandwidth are increased. As of today, DM computers are still being produced and
their success is apparent when considering how common they have become. Still, their limitations
have been revealed and their successors have already appeared. The latter are constructed from a
small number of nodes, where each node is a small DM computer featuring a virtual shared memory. These nodes are interconnected by a simple bus- or crossbar-based interconnection network.
Programming these machines as well as their production is facilitated by the relative simplicity
of the interconnection network. In addition, increasing the computational capabilities of the PUs
appears to be an easier task than increasing the performance of the network. As opposed to large
scale DM computers where all processors are much less powerful than the whole, the collection
of nodes of this hierarchical architecture is only slightly more powerful than its components. The
10
SGI SMP Power Challenge is an existing example of such an architecture. The scalability of these
machines can simultaneously take advantage of the progresses made by the processor and network
technologies as well as the hardware and/or software mechanisms implementing the virtual shared
memory. It is still unclear how these machines will be programmed. Whether these machines will
in the future completely replace DM computers is also a question that is diÆcult to answer today.
In this paper, these machines will also be considered as DM computers.
In order to fully exploit the increasing computational power of DM computers, the application
software must be scalable, that is, able to take advantage of larger machine congurations to solve
larger problems with the same eÆciency. The LINPACK benchmark addresses scalability by an
introduction of a new category of testing rules and environment. This new category is referred to
as a Highly-Parallel LINPACK (HPL) NxN benchmark [10, 13, 14, 15, 24, 71, 44, 74]. It requires
solution of systems of linear equations by some method. The problem size is allowed to vary, and
the best oating-point execution rate should be reported. In computing the execution rate, the
number of operations should be 2n3 =3 + 2n2 independent of the actual method used. If Gaussian
elimination is chosen, partial pivoting must be used. A residual for the accuracy of solution, given
as kAx bk=(kAk kxk), should also be reported. The following quantities of the benchmark are
reported in the TOP500 list:
Rmax the performance in Gop/s for the largest problem run on a machine,
Nmax the size of the largest problem run on a machine,
N1=2 the size where half the Rmax execution rate is achieved,
Rpeak the theoretical peak performance Gop/s for the machine.
To the performance of the HPL NxN benchmark contributes eÆciency of the serial code executed
on a single CPU as well as the parallel algorithm which makes all the CPUs cooperate. The former
may be designed with the use of the optimization techniques mentioned earlier. The essential
ingredients of the latter are load balancing and network latency hiding. Assuming that all of the
CPUs are the same, an even work and data load may is usually achieved with a block-cyclic data
layout [34, 47]. Latency hiding is most often performed with computation and communication
overlapping which is facilitated by an appropriate choice of virtual CPU topology, computational
sequences of LU factorization, and an implementation of collective communication algorithms.
Large body of literature exists on eÆcient parallel implementation of BLAS [1, 2, 3, 20, 21, 38, 39,
40, 43, 49, 56, 72, 63], the factorization algorithm [5, 6, 11, 19, 34, 41, 47, 66, 68, 69, 73], and its
components [12, 46, 54, 55].
4
The TOP500 List
Statistics on high-performance computers are of major interest to manufacturers, and potential
users. They wish to know not only the number of systems installed, but also the location of the
various supercomputers within the high-performance computing community and the applications
for which a computer system is being used. Such statistics can facilitate the establishment of
collaborations, the exchange of data and software, and provide a better understanding of the highperformance computer market.
Statistical lists of supercomputers are not new. Every year since 1986, Hans Meuer [59] has
published system counts of the major vector computer manufacturers, based principally on those
at the Mannheim Supercomputer Seminar. Statistics based merely on the name of the manufacturer are no longer useful, however. New statistics are required that reect the diversication of
11
supercomputers, the enormous performance dierence between low-end and high-end models, the
increasing availability of massively parallel processing (MPP) systems, and the strong increase in
computing power of the high-end models of workstation suppliers (SMP).
To provide this new statistical foundation, the TOP500 list was created in 1993 to assemble
and maintain a list of the 500 most powerful computer systems. The list is compiled twice a year
with the help of high-performance computer experts, computational scientists, manufacturers, and
the Internet community in general.
In the list, computers are ranked by their performance on the HPL NxN benchmark.
5
The HPL Code
5.1
Overview
HPL [64] is a software package that generates and solves a random dense linear system of equations
on distributed-memory computers as it is outlined in Fig. 1. The package uses 64-bit oating point
arithmetic and portable routines for linear algebra operations and message passing. Also, it gives a
possibility of selecting one of multiple factorization algorithms and provides timing and an estimate
of accuracy of the solution. Consequently, it can be regarded as a portable implementation of the
HPL NxN benchmark. It requires implementations of MPI and either BLAS or Vector Signal Image
Processing Library (VSIPL).
/* Generate and partition matrix data among MPI
computing nodes. */
/* ... */
/* All the nodes start at the same time. */
MPI_Barrier(...);
/* Start wall-clock timer. */
HPL_ptimer(...);
HPL_pdgesv(...); /* Solve system of equations. */
/* Stop wall-clock timer. */
HPL_ptimer(...);
/* Obtain the maximum wall-clock time. */
MPI_Reduce(...);
/* Gather statistics about performance rate
(base on the maximum wall-clock time) and
accuracy of the solution. */
/* ... */
Figure 1:
Main computational steps performed by HPL to obtain the HPL NxN benchmark rating.
The HPL package is written in C and requires implementation of either the BLAS [26, 27] or
12
the Vector Signal Image Processing Library (VSIPL).
5.2
Algorithm
HPL solves a linear system of equations of order n:
Ax = b; A 2 Rnn ; x; b 2 Rn
by rst computing LU factorization with row partial pivoting of the n by n + 1 coeÆcient matrix:
P [A; b] = [[L; U ]; y]:
Since the row pivoting (represented by the permutation matrix P ) and the lower triangular factor
L are applied to b as the factorization progresses, the solution x is obtained in one step by solving
the upper triangular system:
Ux = y:
The lower triangular matrix L is left unpivoted and the array of pivots is not returned.
Fig. 2 shows 2-D cyclic block data distribution used by HPL. The data is distributed onto a
two-dimensional grid (of dimensions P by Q) of processes according to the block-cyclic scheme to
ensure good load balance as well as the scalability of the algorithm. The n by n + 1 coeÆcient
matrix is logically partitioned into blocks (each of dimension NB by NB ), that are cyclically dealt
onto the P by Q process grid. This is done in both dimensions of the matrix.
P0
P2
P0
P2
P1
P3
P1
P3
P0
P2
P0
P2
P1
P3
P1
P3
Figure 2:
Ownership of dense subblocks in two-dimensional block cyclic data distribution used by HPL. The number of processors is 4 (named P0, P1, P2, and P3),
they are organized in 2 by 2 grid (P = Q = 2). The number of subblocks is 4 in both
dimensions (N=NB = 4).
The right-looking variant has been chosen for the main loop of the LU factorization. This means
that, at each iteration of the loop, a panel of NB columns is factored, and the trailing submatrix
is updated. Therefore, this computation is logically partitioned with the same block size NB that
was used for the data distribution.
At a given iteration of the main loop, and because of the cartesian property of the distribution
scheme, each panel factorization occurs in one column of processes. This particular part of the
computation lies on the critical path of the overall algorithm. For this operation, the user is oered
a choice of three (Crout, left- and right-looking) recursive variants based on matrix-matrix multiply.
The software also allows the user to choose in how many sub-panels the current panel should be
divided at each recursion level. Furthermore, one can also select at run-time the recursion stopping
criterion in terms of the number of columns left to factor. When this threshold is reached, the
sub-panel will then be factored using one of the three (Crout, left- or right-looking) factorization
algorithms based on matrix-vector operations. Finally, for each panel's column, the pivot search
and the associated swap and broadcast operations of the pivot row are combined into one single
communication step. A binary-exchange (leave-on-all) reduction performs these three operations
at once.
13
Once the panel factorization has been performed, the factored panel of columns is broadcast
to the other process columns. There are many possible broadcast algorithms and the software
currently oers the following variants:
Increasing ring,
Modied increasing ring,
Increasing two-ring,
Modied increasing two-ring,
Bandwidth-reducing,
Modied bandwidth-reducing.
The modied variants relieve the next processor (the one that would participate in factorization
of the panel after the current one) from the burden of sending messages (otherwise it has to receive
as well as send matrix update data). The ring variants propagate the update data in a single
pipeline fashion, whereas the two-ring variants propagate data in two pipelines concurrently. The
bandwidth-reducing variants divide a message to be sent into a number of pieces and scatters
it across a single row of the grid of processors so that more messages are exchanged but the
total volume of communication is independent of number of processors. This becomes particularly
important when the computing nodes are relatively much faster than the interconnect.
Once the current panel has been broadcast (or during the broadcast operation) the trailing
submatrix has to be updated. As mentioned before, the panel factorization lies on the critical
path. This means that when the kth panel has been factored and then broadcast, the next most
urgent task to complete is the factorization and broadcast of the panel k +1. This technique is often
referred to as a look-ahead (or send-ahead) in the literature. HPL allows to select various depths
of look-ahead. By convention, a depth of zero corresponds to having no look-ahead, in which case
the trailing submatrix is updated by the panel currently broadcast. Look-ahead consumes some
extra memory to keep all the panels of columns currently in the look-ahead pipe. A look-ahead of
depth 1 or 2 is most likely to achieve the best performance gain.
The update of the trailing submatrix by the last panel in the look-ahead pipe is performed
in two phases. First, the pivots must be applied to form the current row panel of U . Second,
upper triangular solve using the column panel occurs. Finally, the updated part of U needs to be
broadcast to each process within a single column so that the local rank update of size NB can take
place. It has been decided to combine the swapping and broadcast of U at the cost of replicating
the solve. The following algorithms are available for this communication operation:
binary-exchange,
bandwidth-reducing.
The rst variant is a modied leave-on-all reduction operation. The second one has communication volume complexity that solely depends on the size of U (the number of process rows only
impacts the number of messages being exchanged) and, consequently, should outperform the previous variant for large problems on large machine congurations. In addition, both of the previous
variants may be combined in a single run of the code.
After the factorization has ended, the backward substitution remains to be done. HPL uses
look-ahead of depth one to do this. The right hand side is forwarded in process rows in a decreasingring fashion, so that we solve Q NB entries at a time. At each step, this shrinking piece of the
14
right-hand-side is updated. The process just above the one owning the current diagonal block of
the matrix updates its last NB entries of vector x, forwards it to the previous process column, and
then broadcasts it in the process column in a decreasing-ring fashion. The solution is then updated
and sent to the previous process column. The solution of the linear system is left replicated in
every process row.
To verify the result, the input matrix and right-hand side are regenerated. The following scaled
residuals are computed ( is the relative machine precision):
rn =
r1 =
r1 =
kAx bk1
kAk1 n kAx bk1
kAk1 kxk1 kAx bk1
kAk1 kxk1 A solution is considered numerically correct when all of these quantities are of order O(1).
5.3
Scalability
HPL was designed for distributed-memory computers. They consist of processors that are connected
using a message passing interconnection network. Each processor has its own memory called the
local memory, which is accessible only to that processor. As the time to access a remote memory
is longer than the time to access a local one, such computers are often referred to as Non-Uniform
Memory Access (NUMA) machines. The interconnection network of our machine model is static,
meaning that it consists of point-to-point communication links among processors. This type of
network is also referred to as a direct network as opposed to dynamic networks. The latter are
constructed from switches and communication links. These links are dynamically connected to
one another by the switching elements to establish, at run-time, the paths between processors'
memories. The interconnection network of the two-dimensional machine model considered here is a
static, fully connected physical topology. It is also assumed that processors can be treated equally
in terms of local performance and that the communication rate between two processors depends
only on the processors being considered. Our model assumes that a processor can send or receive
data on only one of its communication ports at a time (assuming it has more than one). In the
literature, this assumption is also referred to as a one-port communication model. The time spent
to communicate a message between two given processors is called the communication time Tc . In
our machine model, Tc is approximated by a linear function of the number L of double precision
oating point numbers being communicated. Tc is the sum of the time to prepare the message for
transmission and the time L taken by the message of length L to traverse the network to its
destination:
Tc = + L:
Finally, the model assumes that the communication links are bi-directional, i.e., the time for two
processors to send each other a message of length L is also Tc . A processor can send and/or receive
a message on only one of its communication links at a time. In particular, a processor can send a
message while receiving another message from the processor it is sending to at the same time.
Since this document is only concerned with regular local dense linear algebra operations, the
time taken to perform one oating point operation is assumed to be modeled by three constants:
1 , 2 , and 3 . These quantities are single processor approximations of oating point operation
(FLOP) rates of the vector-vector, matrix-vector and matrix-matrix operations, respectively. This
15
approximation describes all the steps performed by a processor to complete its portion of matrix
factorization. Such a model neglects all the phenomena occurring in the processor components,
such as cache misses, pipeline start-ups, memory load or store, oating point arithmetic, TLB
misses etc. They may inuence the value of the constants and may vary as, e.g. a function of the
problem size.
Similarly, the model does not make any assumptions on the amount of physical memory available
on a single node. It is assumed that if a process has been spawn on a processor, one has ensured
that enough memory was available on that processor. In other words, swapping will not occur
during the modeled computation.
Consider an M by N panel distributed over a column of P processors. Because of the recursive
formulation of the panel factorization, it is reasonable to consider that the oating point operations
will be performed at matrix-matrix multiply rate. For every column in the panel a binary-exchange
is performed on 2N data items. When this panel is broadcast, what matters is the time that
the next process column will spend in this communication operation. Assuming one chooses the
modied increasing ring variant, only one message needs to be taken into account. The execution
time of the panel factorization and broadcast can thus be approximated by:
M N MN
Tpf act (M; N ) = 3
N 2 + N log P ( + 2N ) + +
:
P
3
P
The update phase of an N by N trailing submatrix distributed on a P by Q process grid is
achieved with a triangular solve of N right hand sides and a local update of rank NB of the trailing
submatrix. Assuming that the long variant is chosen, the execution time of the update operation
can be approximated by:
Tupdate (N; NB ) = 3
!
N NB2 2N 2 NB
+
+ (log P + P
Q
PQ
1) +
3N NB
Q
:
The constant 3 in front of the term is obtained by counting one for the logarithmic spread phase
and two for the rolling phase. In the case of bi-directional links this constant should be replaced
by 2.
The number of oating point operations performed during the backward substitution is given by
2
N =(P Q). Because of the look-ahead, the communication cost can be approximated at each step
by two messages of length NB , i.e. the time to communicate the piece of size NB of the solution
vector from one diagonal block of the matrix to another. It follows that the execution time of the
backward substitution can be approximated by:
Tbacks (N; NB ) =
2 N 2
+N
+ 2 :
PQ
NB
The total execution time of the algorithm described above is given by:
N
X
[Tpf act (N
k =0
k NB ; NB ) + Tupdate (N
(k
1) NB ; NB )] + Tbacks (N; NB ):
By considering only the dominant term in , and 3 :
THP L = 3
2N 3
N 2 (3P + Q)
N ((NB + 1) log P + P )
+
+
:
3P Q
2P Q
NB
16
The serial execution time is given by Tser = 23 3 N 3 . If we dene the parallel eÆciency E as the
ratio Tser =(P Q THP L ), we obtain:
E = 1+
3(3P + Q)
3P Q((NB + 1) log P + P )
+
43 N
2N 2 NB 3
1
:
This last equality shows that when the memory usage per processor N 2 =(P Q) is maintained
constant, the parallel eÆciency slowly decreases only because of the term. The communication
volume (the term), however, remains constant. Due to these results, HPL is said to be scalable
not only with respect to the amount of computation, but also with respect to the communication
volume.
5.4
Performance Results
Tables 8 and 9 show system parameters and performance results of HPL for a cluster of workstations
based on an AMD Athlon processor. Similarly, Tables 10 and 11 describe a cluster based on the
Intel Pentium III processor. Performance tests were also performed on a Compaq cluster installed
at Oak Ridge National Laboratory in Tennessee, U.S.A. This cluster is listed at position 90 on
the June 2001 TOP500 list. Its description and performance is given in Tables 12 and 13. The
LINPACK benchmark numbers for this system are presented in Table 14.
CPU
Main memory
Interconnect
OS
C compiler
C ags
MPI
BLAS
Date
AMD Athlon K7 500 Mhz
256 MB
100 Mbps Switched
2 NICs per node (channel bonding)
RedHat Linux 6.2 (kernel 2.2.14)
gcc ver. 2.91.66 (egcs-1.1.2 release)
-fomit-frame-pointer -O3 -funroll-loops
MPICH 1.2.1
ATLAS 3.0 beta
September 2000
Table 8: Description of the AMD cluster used in tests.
Processor Grid
Matrix Dimension
Dimension
2000 5000 8000 10000
1 by 4
1.28 1.73 1.89
1.95
2 by 2
1.17 1.68 1.88
1.93
4 by 1
0.81 1.43 1.70
1.80
Table 9: Performance (in Gop/s) of HPL on an AMD cluster with 4 computing nodes.
5.5
Running and Tuning
In order to nd the best performance of a given system, the largest problem size tting in memory
should be used. The amount of memory used by HPL is essentially the size of the coeÆcient matrix.
17
CPU
Intel Pentium III 550 Mhz
Main memory
512 MB
Interconnect
Myrinet
OS
RedHat Linux 6.1 (kernel 2.2.15)
C compiler
gcc ver. 2.91.66 (egcs-1.1.2 release)
C ags
-fomit-frame-pointer -O3 -funroll-loops
MPI
MPI GM ver. 1.2.3
BLAS
ATLAS ver. 3.0 beta
Date
September 2000
Table 10: Description of the Pentium cluster used in tests.
Processor Grid
Matrix Dimension
Dimension
2000 5000 8000 10000 15000 20000
2 by 4
1.76 2.32 2.51
2.58
2.72
2.73
4 by 4
2.27 3.94 4.46
4.68
5.00
5.16
Table 11: Performance (in Gop/s) of HPL on a Pentium cluster with up to 16 computing nodes.
HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB , the better the load balance,
so, consequently, one denitely should avoid very large values of NB . From a local computation
point of view, too small value of NB may limit the computational performance by a large factor
because almost no data reuse will occur in the fastest level of the memory hierarchy. The number of
messages will also increase. EÆcient matrix-multiply routines are often internally blocked. Small
multiples of this blocking factor are likely to be good block sizes for HPL.
6
Acknowledgements
The authors acknowledge the use of the Oak Ridge National Laboratory Compaq cluster, funded
by the Department of Energy's OÆce of Science and Energy EÆciency programs and IBM supercomputer based on Power4 microprocessor provided as a part of the Department of Energy's
Scientic Discovery through Advanced Computing (SCiDAC) program. The use of UltraSparc III
server at the Department of Mathematics and Computer Science at Emory University is also greatly
CPU
EV67 667 Mhz
OS
True64 ver. 5
C compiler
cc ver. 6.1
C ags
-arch host -tune host -std -O5
MPI
native (linker ags: -lmpi -lelan)
BLAS
CXML
Date
September 2000
Table 12: Description of the Compaq cluster used in tests.
18
Processor Grid
Matrix Dimension
Dimension
5000
10000 25000 53000
8 by 8
26.37 45.00.73 60.99 66.93
Table 13: Performance (in Gop/s) of HPL on a Compaq cluster with 64 computing nodes.
CPUs/Nodes
1/1
4/1
16/4
64/16
256/64
N1=2
Nmax
150
6625
800 12350
2300 26500
5700 53000
14000 106000
Rmax
[Gop/s]
1.14
4.36
17.0
67.5
263.6
E
[%]
95.6
93.2
92.5
90.1
Table 14: LINPACK benchmark numbers for the Compaq cluster obtained using HPL.
appreciated.
References
[1] M. Aboelaze, N. Chrisochoides, and E. Houstis. The parallelization of Level 2 and 3 BLAS
operations on distributed-memory machines. Technical Report CSD-TR-91-007, Purdue University, West Lafayette, IN, 1991.
[2] R. Agarwal, S. Balle, F. Gustavson, M. Joshi, and P. Palkar. A three-dimensional approach
to parallel matrix multiplication. IBM Journal or Research and Development, 39(5):575{582,
1995.
[3] R. Agarwal, F. Gustavson, and M. Zubair. A high performance matrix multiplication algorithm
on a distributed-memory parallel computer, using overlapped communication. IBM Journal
or Research and Development, 38(6):673{681, 1994.
[4] E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra,
J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. LAPACK
User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, Third edition, 1999.
[5] Cleve Ashcraft. The distributed solution of linear systems using the torus-wrap data mapping.
Technical Report ECA-TR-147, Boeing Computer Services, Seattle, WA, 1990.
[6] Cleve Ashcraft. A taxonomy of distributed dense LU factorization methods. Technical Report
ECA-TR-161, Boeing Computer Services, Seattle, WA, 1991.
[7] D. W. Barron and H. P. F. Swinnerton-Dyer. Solution of simultaneous linear equations using
a magnetic tape store. Computer J., 3:28{33, 1990.
[8] M. Berry, K. Gallivan, W. Harrod, W. Jalby, S. Lo, U. Meier, B. Philippe, and A. Sameh.
Parallel algorithms on CEDAR system. Technical Report Report No. 581, CSRD, 1986.
19
[9] C. Bischof and Charles F. Van Loan. The WY representation for products of Householder
matrices. SIAM SISSC, 8(2), March 1987.
[10] R. Bisseling and L. Loyens. Towards peak parallel LINPACK performance on 400. Supercomputer, 45:20{27, 1991.
[11] R. Bisseling and J. van der Vorst. Parallel LU decomposition on a transputer network. In
Eds. G. van Zee and J. van der Vorst, editors, Lecture Notes in Computer Sciences, volume
384, pages 61{77. Springer-Verlag, 1989.
[12] R. Bisseling and J. van der Vorst. Parallel triangular system solving on a mesh network of
transputers. SIAM Journal on Scientic and Statistical Computing, 12:787{799, 1991.
[13] J. Bolen, A. Davis, B. Dazey, S. Gupta, G. Henry, D. Robboy, G. Schier, D. Scott, M. Stallcup, A. Taraghi, S. Wheat (from Intel SSD), L. Fisk, G. Istrail, C. Jong, R. Riesen, and
L. Shuler (from Sandia National Laboratories). Massively parallel distributed computing:
World's rst 281 Gigaop supercomputer. In Proceedings of the Intel Supercomputer Users
Group, 1995.
[14] R. Brent. The LINPACK benchmark on the AP 1000. Frontiers, pages 128{135, 1992.
[15] R. Brent and P. Strazdins. Implementation of BLAS Level 3 and LINPACK benchmark on
the AP1000. Fujitsu Scientic and Technical Journal, 5(1):61{70, 1993.
[16] I. Bucher and T. Jordan. Linear algebra programs for use on a vector computer with a
secondary solid state storage device. In R. Vichnevetsky and R. Stepleman, editors, Advances
in Computer Methods for Practical Dierential Equations, pages 546{550. IMACS, 1984.
[17] D. A. Calahan. Block-oriented local-memory-based linear equation solution on the CRAY-2:
Uniprocessor algorithms. In U. Schendel, editor, Proceedings International Conference on
Parallel Processing, pages 375{378. IEEE Computer Society Press, August 1986.
[18] B. Chartres. Adaption of the Jacobi and Givens methods for a computer with magnetic tape
backup store. Technical Report 8, University of Sydney, 1960.
[19] J. Choi, Jack J. Dongarra, Susan Ostrouchov, Antione Petitet, David Walker, and R. Clint
Whaley. The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientic Programming, 5:173{184, 1996.
[20] J. Choi, Jack J. Dongarra, and David Walker. Pumma: Parallel universal matrix multiplication algorithms on distributed-memory concurrent computers. Concurrency: Practice and
Experience, 6(7):543{570, 1994.
[21] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. van de Geijn. Parallel implementation of blas: General techniques for level 3 blas. Concurrency: Practice and Experience,
9(9):837{857, 1997.
[22] Mathematical Committee on Physical and Engineering Sciences, editors. Grand Challenges:
High Performance Computing and Communications. NSF/CISE, 1800 G Street NW, Washington, DC, 20550, 1991.
[23] A. K. Dave and Iain S. Du. Sparse matrix calculations on the CRAY-2. Technical Report
Report CSS 197, AERE Harwell, 1986.
20
[24] Jack J. Dongarra. Performance of various computers using standard linear equations software.
Technical Report CS-89-85, University of Tennessee, 1989. (An updated version of this report
can be found at http://www.netlib.org/benchmark/performance.ps).
[25] Jack J. Dongarra, J. Bunch, Cleve Moler, and G. W. Stewart. LINPACK User's Guide. SIAM,
Philadelphia, PA, 1979.
[26] Jack J. Dongarra, J. Du Croz, Iain S. Du, and S. Hammarling. A set of Level 3 FORTRAN
Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16:1{17,
March 1990.
[27] Jack J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An extended set of FORTRAN
Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14:1{17,
March 1988.
[28] Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Numerical
Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA, 1998.
[29] Jack J. Dongarra, Victor Eijkhout, and Piotr Luszczek. Recursive approach in sparse matrix
LU factorization. In Proceedings of the 1st SGI Users Conference, pages 409{418, Cracow,
Poland, October 2000. ACC Cyfronet UMM. Accepted for publication in Scientic Programming.
[30] Jack J. Dongarra and T. Hewitt. Implementing dense linear algebra algorithms using mutlitasking on CRAY X-MP-4. SIAM J. Sci Stat Comp., 7(1):347{350, January 1986.
[31] Jack J. Dongarra and A. Hinds. Unrolling loops in Fortran. Software-Practice and Experience,
9:219{226, 1979.
[32] Jack J. Dongarra, Antoine Petitet, and Clint R. Whaley. Automated empirical optimization
of software and the ATLAS project. Parallel Computing, 27(1-2):3{25, 2001.
[33] Jack J. Dongarra and Danny C. Sorensen. Linear algebra on high-performance computers. In
U. Schendel, editor, Proceedings Parallel Computing 85, pages 3{32. North Holland, 1986.
[34] Jack J. Dongarra, R. van de Geijn, and David Walker. Scalability issues in the design of a
library for dense linear algebra. Journal of Parallel and Distributed Computing, 22(3):523{537,
1994. (Also LAPACK Working Note No. 43).
[35] Jack J. Dongarra and Clint R. Whaley. Automatically tuned linear algebra software (ATLAS).
In Proceedings of SC'89 Conference, 1989.
[36] J. DuCroz, S. Nugent, J. Reid, and D. Taylor. Solving large full sets of linear equations in a
paged virtual store. ACM Transactions on Mathematical Software, 7(4):527{536, 1981.
[37] Iain S. Du. Full matrix techniques in sparse Gaussian elimination. In Numerical Analysis Proceedings, number 912 in Lecture Notes in Mathematics, pages 71{84, Dundee, 1981.
Springer-Verlag, Berlin, 1981.
[38] A. Elster. Basic matrix subprograms for distributed-memory systems. In David Walker and
Q. Stout, editors, Proceedings of the Fifth Distributed-Memory Computing Conference, pages
311{316. IEEE Press, 1990.
21
[39] R. Falgout, A. Skjellum, S. Smith, and C. Still. The multicomputer toolbox approach to
concurrent BLAS and LACS. In Proceedings of the Scalable High Performance Computing
Conference SHPCC-92. IEEE Computer Society Press, 1992.
[40] G. Fox, S. Otto, and A. Hey. Matrix algorithms on a Hypercube I: Matrix multiplication.
Parallel Computing, 3:17{31, 1987.
[41] G. Geist and C. Romine. Lu factorization algorithms on distributed-memory multiprocessor
architectures. SIAM Journal on Scientic and Statistical Computing, 9:639{649, 1988.
[42] A. George and H. Rashwan. Auxiliary storage methods for solving nite element systems.
SIAM SISSC, 6:882{910, 1985.
[43] B. Grayson and R. van de Geijn. A high performance parallel Strassen implementation. Parallel
Processing Letters, 6(1):3{12, 1996.
[44] B. Greer and G. Henry. High performance software on Intel Pentium Pro processors or microops to TeraFLOPS. In Proceedings of the SuperComputing 1997 Conference, San Jose, California, 1997. ACM SIGARCH, IEEE Computer Society Press. ISBN: 0-89791-985-8.
[45] Fred G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra
algorithms. IBM Journal of Research and Development, 41(6):737{755, November 1997.
[46] M. Heath and C. Romine. Parallel solution triangular systems on distributed-memory multiprocessors. SIAM Journal on Scientic and Statistical Computing, 9:558{588, 1988.
[47] B. Hendrickson and D. Womble. The torus-wrap mapping for dense matrix calculations on
massively parallel computers. SIAM Journal on Scientic and Statistical Computing, 15:1201{
1226, 1994.
[48] R. W. Hockney and C. R. Jesshope. Parallel Computers. Adam Hilger Ltd, Bristol, 1981.
[49] S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang. Matrix multiplication on the Intel
Touchstone DELTA. Concurrency: Practice and Experience, 6(7):571{594, 1994.
[50] IBM. Engineering and Scientic Subroutine Library. IBM, 1986. Program Number: 5668-863.
[51] W. Kaufmann and L. Smarr. Supercomputing and the Transformation of Science. Scientic
American Library, 1993.
[52] Donald Knuth. An empirical study of Fortran programs. Software-Practice and Experience,
1:105{133, 1971.
[53] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra Subprograms for
Fortran usage. ACM Transactions on Mathematical Software, 5:308{323, 1979.
[54] G. Li and T. Coleman. A parallel triangular solver for a distributed-memory multiprocessor.
SIAM Journal on Scientic and Statistical Computing, 9(3):485{502, 1988.
[55] G. Li and T. Coleman. A new method for solving triangular systems on distributed-memory
message-passing multiprocessor. SIAM Journal on Scientic and Statistical Computing,
10(2):382{396, 1989.
22
[56] J. Li, R. Falgout, and A. Skjellum. A poly-algorithm for parallel dense matrix multiplication on
two-dimensional process grid topologies. Concurrency: Practice and Experience, 9(5):345{389,
1997.
[57] The LINPACK 1000x1000 benchmark program. (See http://www.netlib.org/benchmark/
1000d for source code.).
[58] A. C. McKellar and E. G. Coman Jr. Organizing matrices and matrix operations for paged
memory systems. Communications of the ACM, 12(3):153{165, 1969.
[59] Hans W. Meuer, Erik Strohmaier, Jack J. Dongarra, and H.D. Simon. Top500 Supercomputer
Sites, 17th edition edition, November 2 2001. (The report can be downloaded from http://
www.netlib.org/benchmark/top500.html).
[60] OÆce of Science and Technology Policy, editors. A Research and Development Strategy for
High Performance Computing. Executive OÆce of the President, 1987.
[61] OÆce of Science and Technology Policy, editors. The Federal High Performance Computing
Program. Executive OÆce of the President, 1989.
[62] D. Pager. Some notes on speeding up certain loops by software, rmware, and hardware means.
IEEE Trans. on Comp., pages 97{100, January 1972.
[63] Antoine Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. Computer Science Department, University of Tennessee Knoxville, 1996. Ph.D. thesis.
[64] Antoine Petitet, R. Clint Whaley, Jack J. Dongarra, and Andy Cleary. HPL { A Portable
Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Innovative Computing Laboratory, September 2000. Available at http://icl.cs.
utk.edu/hpl/ and http://www.netlib.org/hpl/.
[65] Y. Robert and P. Suguazerro. The LU decomposition algorithm and its eÆcient fortran implementation on the IBM 3090 vector multiprocessor. Technical Report ECSEC Report ICE-0006,
IBM, March 1987.
[66] Youcef Saad. Communication complexity of the Gaussian elimination algorithm on multiprocessors. Linear Algebra and Its Applications, 77:315{340, 1986.
[67] R. Schreiber. Engineering and Scientic Subroutine Library. Module Design Specication.
SAXPY Computer Corporation, 255 San Geronimo Way, Sunnyvale, CA 94086, 1986.
[68] P. Strazdins. Matrix factorization using distributed panels on the Fujitsu AP1000. In Proceed-
ings of the IEEE First International Conference on Algorithms And Architectures for Parallel
Processing ICA3PP-95, Brisbane, 1995.
[69] P. Strazdins. Lookahead and algorithmic blocking techniques compared for parallel matrix
factorization. In Proceedings of the 10th International Conference on Parallel and Distributed
Computing and Systems, IASTED, Las Vegas, 1998.
[70] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on
Matrix. Anal. Appl., 18(4), 1997.
23
[71] R. van de Geijn. Massively parallel LINPACK benchmark on the Intel Touchstone DELTA
and iPSC/860 systems. In 1991 Annual Users Conference Proceedings, Dallas, Texas, 1991.
Intel Supercomputer Users Group.
[72] R. van de Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm.
Concurrency: Practice and Experience, 9(4):255{274, 1997.
[73] E. van de Velde. Experiments with multicomputer LU-decomposition. Concurrency: Practice
and Experience, 2:1{26, 1990.
[74] D. Womble, D. Greenberg, D. Wheat, and S. Riesen. LU factorization and the LINPACK
benchmark on the Intel Paragon. Technical report, Sandia National Laboratories, 1994.
24
Appendix
Description
In the following, a number of performance results is presented for variety of CPUs. The results
show:
asymptotic performance graphs of BLAS routines relevant to the HPL NxN benchmark,
single-CPU LINPACK benchmark numbers for 1000 by 1000 problem,
selected hardware parameters of the systems used in tests.
The original LINPACK benchmark used Level 1 BLAS routines: DDOT and DAXPY. As the following graphs show, their performance is severely limited on contemporary superscalar processors.
The Level 2 BLAS DGEMV routine may be used in an implementation of a LINPACK benchmark
code. As the results indicate, however, this should also be avoided as its performance is limited by
the system bus throughput. The Level 3 BLAS DGEMM routine achieves the highest fraction of peak
performance and, consequently, should be the preferred BLAS to use for the LINPACK benchmark
code.
The LINPACK benchmark numbers for each of the CPUs were obtained by solving a system
of 1000 simultaneous linear equations so they may be easily compared to each other. The label
\Linpack source" refers to the self contained LINPACK benchmark code (available at http://www.
netlib.org/benchmark/1000d) which relies only on compiler optimizations to increase the rate of
execution. The code labeled \Linpack BLAS" requires external (possibly optimized) BLAS. The
codes \Right-looking", \Left-looking", and \Crout" represent variants of LU factorization based
on Level 2 BLAS routines. \Right-looking" relies on DGER and is equivalent to LAPACK's DGETF2
routine [4]. \Left-looking" variant uses DTRSV and DGEMV routines. And nally, \Crout" variant
uses DGEMV routine only. Two codes that use Level 3 BLAS are \Lapack" (which solves the problem
by using LAPACK's DGETRF and DGETRS routines [4]) and \Recursive LU" which calls DTRSM and
DGEMM routines and is based on recursive formulation of LU factorization.
Unless stated otherwise, all codes use BLAS provided by ATLAS [32, 35] version 3.2.1 or higher
(available at http://www.netlib.org/atlas/ and http://math-atlas.sourceforge.net/).
25
Performance Results for Pentium III 550MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
75
42
DDOT
DAXPY
70
40
65
38
MFLOP/s
MFLOP/s
60
55
36
50
34
45
32
40
35
30
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
Vector dimension
400
500
600
700
800
900
1000
Vector dimension
Level 2 BLAS: DGEMV
Level 3 BLAS: DGEMM
120
450
DGEMV
DGEMM
110
400
100
350
MFLOP/s
MFLOP/s
90
80
300
250
70
200
60
150
50
40
100
0
200
400
600
800
1000
Matrix dimension
1200
1400
1600
1800
Linpack benchmark numbers
bk
Code
MFLOP/s kAx bk kAkAx
kkxkn
Level 1 BLAS
Linpack source
41.4
10 12
10.5
12
Linpack BLAS
55.4
10
10.5
Level 2 BLAS
Right-looking
51.4
10 13
2.3
Left-looking
74.7
10 13
1.4
13
Crout
84.0
10
1.4
Level 3 BLAS
Lapack
278.6
10 13
1.3
13
Recursive LU
324.6
10
1.3
26
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
550
Bus speed [MHz]
100
L1 cache [KB]
16+16
L2 cache [KB]
512
1400
1600
1800
2000
Performance Results for Pentium III 933MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
85
58
DDOT
DAXPY
56
80
54
75
70
MFLOP/s
MFLOP/s
52
65
50
48
46
60
44
55
42
50
40
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
200
700
DGEMV
DGEMM
180
600
160
500
MFLOP/s
MFLOP/s
140
120
400
100
300
80
200
60
40
100
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
49.3
10 12
Linpack BLAS
53.6
10 12
Level 2 BLAS
Right-looking
86.1
10 13
Left-looking
144.7
10 13
Crout
115.7
10 13
Level 3 BLAS
Lapack
410.2
10 13
Recursive LU
506.6
10 13
Code
1600
1800
kAx bk
kAkkxkn
10.5
10.5
2.9
1.8
2.0
1.5
1.2
27
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
933
Bus speed [MHz]
133
L1 cache [KB]
16+16
L2 cache [KB]
256
1400
1600
1800
2000
Performance Results for Intel P4 1700MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
290
185
DDOT
DAXPY
180
285
175
280
MFLOP/s
MFLOP/s
170
275
270
165
160
265
155
260
150
255
145
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
450
3000
DGEMV
DGEMM
400
2500
350
2000
MFLOP/s
MFLOP/s
300
250
1500
200
1000
150
500
100
50
0
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
182.2
10 12
Linpack BLAS
238.8
10 12
Level 2 BLAS
Right-looking
199.6
10 13
Left-looking
290.7
10 13
Crout
312.5
10 13
Level 3 BLAS
Lapack
1262.0
10 13
Recursive LU
1393.0
10 13
Code
1600
1800
kAx bk
kAkkxkn
10.5
10.5
2.2
1.7
2.8
8.4
7.2
28
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
1700
Bus speed [MHz]
400
L1 cache [KB]
12(I)+8(D)
L2 cache [KB]
256
1400
1600
1800
2000
Performance Results for Intel/HP Itanium 800MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
33.5
27.4
DDOT
DAXPY
27.2
33
27
32.5
MFLOP/s
MFLOP/s
26.8
32
31.5
26.6
26.4
31
26.2
30.5
26
30
25.8
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
180
2200
DGEMV
DGEMM
2000
160
1800
140
1600
1400
MFLOP/s
MFLOP/s
120
100
80
1200
1000
800
600
60
400
40
200
20
0
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
39.3
10 12
Linpack BLAS
36.9
10 12
Level 2 BLAS
Right-looking
40.5
10 13
Left-looking
108.5
10 12
Crout
123.7
10 13
Level 3 BLAS
Lapack
624.2
10 13
Recursive LU
801.8
10 13
Code
1600
1800
kAx bk
kAkkxkn
9.4
9.4
6.3
10.0
10.2
8.5
6.5
29
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
800
Bus speed [MHz]
100
L1 cache [KB]
16+16
L2 cache [KB]
96
L3 cache [MB]
2
1400
1600
1800
2000
Performance Results for AMD Athlon 1200MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
84
68
DDOT
DAXPY
66
82
64
80
62
78
MFLOP/s
MFLOP/s
60
76
74
58
56
72
54
70
52
68
50
66
48
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
240
1800
DGEMV
DGEMM
220
1600
200
1400
180
1200
MFLOP/s
MFLOP/s
160
140
120
1000
800
100
600
80
400
60
40
200
0
Code
200
400
600
800
1000
Matrix dimension
1200
MFLOP/s
Level 1 BLAS
Linpack 1000 (source)
63.0
Linpack 1000 (BLAS)
91.0
Level 2 BLAS
Right-looking
124.3
Left-looking
173.7
Crout
137.0
Level 3 BLAS
Lapack
835.8
Recursive LU
998.3
1400
1600
1800
kAx bk
kAkkxkn
kAx bk
10
10
12
10
10
10
13
10
10
13
0
10.5
10.5
12
6.6
7.4
8.8
13
13
1.6
1.1
13
30
200
400
600
800
1000
1200
Matrix dimension
1400
System parameters
Clock rate [MHz]
1200
Bus speed [MHz]
200
L1 cache [KB]
64+64
L2 cache [KB]
256
1600
1800
2000
Performance Results for IBM Power3 375MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
150
110
DDOT
DAXPY
140
100
130
90
MFLOP/s
MFLOP/s
120
110
100
90
80
70
80
60
70
60
50
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
350
1600
DGEMV
DGEMM
1400
300
1200
250
MFLOP/s
MFLOP/s
1000
200
800
600
150
400
100
200
50
0
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
169.3
10 12
Linpack BLAS
165.9
10 12
Level 2 BLAS
Right-looking
174.6
10 13
Left-looking
298.5
10 12
Crout
290.7
10 12
Level 3 BLAS
Lapack
941.8
10 13
Recursive LU
1078.0
10 13
Code
1600
1800
kAx bk
kAkkxkn
4.7
4.7
4.0
4.7
4.7
4.0
4.5
31
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
375
Bus speed [MHz]
100
L1 cache [KB]
32(I)+64(D)
L2 cache [MB]
8
1400
1600
1800
2000
Performance Results for IBM Power4 1300MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
650
340
DAXPY
600
320
550
300
500
280
MFLOP/s
MFLOP/s
DDOT
450
260
400
240
350
220
300
200
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
1000
4500
DGEMV
DGEMM
900
4000
800
3500
700
MFLOP/s
MFLOP/s
3000
600
500
2500
2000
400
1500
300
1000
200
100
500
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
301.2
10 12
Linpack BLAS
294.6
10 12
Level 2 BLAS
Right-looking
288.2
10 12
Left-looking
625.0
10 12
Crout
566.7
10 12
Level 3 BLAS
Lapack
2157.0
10 12
Recursive LU
2388.0
10 13
Code
1600
1800
kAx bk
kAkkxkn
4.7
4.7
5.8
6.4
4.8
5.8
6.8
32
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
1300
Bus speed [MHz]
333
L1 cache [KB]
64(I)+32(D)
L2 cache [MB]
1.4
L3 cache [MB]
32
1400
1600
1800
2000
Performance Results for IBM Power4 1300MHz (ESSL BLAS)
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
600
400
DDOT
DAXPY
550
350
500
450
300
MFLOP/s
MFLOP/s
400
350
250
300
200
250
200
150
150
100
100
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
900
1000
Level 3 BLAS: DGEMM
1000
4000
DGEMV
DGEMM
900
3500
800
3000
700
2500
MFLOP/s
MFLOP/s
800
600
2000
500
1500
400
1000
300
500
0
Code
200
400
600
800
1000
Matrix dimension
1200
MFLOP/s
Level 1 BLAS
Linpack 1000 (source)
302.6
Linpack 1000 (BLAS)
331.0
Level 2 BLAS
Right-looking
339.4
Left-looking
825.5
Crout
668.7
Level 3 BLAS
Lapack 1000
2477.0
Recursive LU 1000
2786.0
1400
1600
1800
kAx bk
kAkkxkn
kAx bk
10
10
12
10
10
10
12
10
10
12
0
4.7
4.7
12
7.1
7.1
7.1
12
12
7.1
6.8
12
33
200
400
600
800
1000
1200
Matrix dimension
1400
1600
System parameters
Clock rate [MHz]
1300
Bus speed [MHz]
333
L1 cache [KB]
64(I)+32(D)
L2 cache [MB]
1.4
L3 cache [MB]
32
1800
2000
Performance Results for SGI Octane R12000 IP30 270MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
65
34
DDOT
DAXPY
32
60
30
MFLOP/s
MFLOP/s
55
50
28
26
45
24
40
22
35
20
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
100
500
DGEMV
DGEMM
450
90
400
80
MFLOP/s
MFLOP/s
350
70
60
300
250
50
200
40
150
30
100
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
60.8
10 13
Linpack BLAS
31.8
10 13
Level 2 BLAS
Right-looking
38.1
10 13
Left-looking
93.0
10 13
Crout
88.5
10 13
Level 3 BLAS
Lapack
336.0
10 13
Recursive LU
400.4
10 13
Code
1600
1800
kAx bk
kAkkxkn
6.4
6.4
7.3
8.3
7.9
7.2
6.9
34
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
270
Bus speed [MHz]
100
L1 cache [KB]
32+32
L2 cache [MB]
2
1400
1600
1800
2000
Performance Results for Compaq/DEC Alpha 21264 EV67 500MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
160
100
DAXPY
150
90
140
80
130
70
MFLOP/s
MFLOP/s
DDOT
120
60
110
50
100
40
90
30
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
260
1000
DGEMV
DGEMM
240
900
220
800
200
700
MFLOP/s
MFLOP/s
180
160
140
600
500
120
400
100
300
80
60
200
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
187.5
10 13
Linpack BLAS
167.9
10 13
Level 2 BLAS
Right-looking
159.2
10 13
Left-looking
188.4
10 13
Crout
214.6
10 13
Level 3 BLAS
Lapack
565.1
10 13
Recursive LU
636.9
10 13
Code
1600
1800
kAx bk
kAkkxkn
6.5
6.5
7.2
8.0
8.0
8.5
7.7
35
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
500
Bus speed [MHz]
333
L1 cache [KB]
64+64
L2 cache [MB]
4
1400
1600
1800
2000
Performance Results for Compaq/DEC Alpha 21164 533MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
26
28
DDOT
DAXPY
25.5
26
25
24
MFLOP/s
MFLOP/s
24.5
24
23.5
22
20
23
18
22.5
16
22
21.5
14
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
80
900
DGEMV
DGEMM
75
800
70
700
65
600
MFLOP/s
MFLOP/s
60
55
50
500
400
45
300
40
200
35
30
100
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
34.0
10 13
Linpack BLAS
48.1
10 13
Level 2 BLAS
Right-looking
45.9
10 13
Left-looking
76.7
10 13
Crout
74.6
10 13
Level 3 BLAS
Lapack
425.6
10 13
Recursive LU
500.9
10 13
Code
1600
1800
kAx bk
kAkkxkn
6.5
6.5
5.8
8.1
7.3
8.1
6.5
36
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz] 533
Bus speed [MHz]
88
L1 cache [KB]
8+8
L2 cache [KB]
96
1400
1600
1800
2000
Performance Results for Sun UltraSparc III 750MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
130
100
DDOT
DAXPY
95
120
90
MFLOP/s
MFLOP/s
110
100
85
80
90
75
80
70
70
65
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
240
1100
DGEMV
DGEMM
1000
220
900
200
800
180
700
MFLOP/s
MFLOP/s
160
140
600
500
120
400
100
300
80
200
60
100
40
0
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
171.5
10 13
Linpack BLAS
123.8
10 13
Level 2 BLAS
Right-looking
150.9
10 12
Left-looking
219.2
10 12
Crout
205.7
10 13
Level 3 BLAS
Lapack
675.4
10 12
Recursive LU
734.8
10 12
Code
1600
1800
kAx bk
kAkkxkn
6.5
6.5
9.9
9.9
7.6
9.9
9.6
37
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
750
Bus speed [MHz]
150
L1 cache [KB]
64(I)+32(D)
L2 cache [MB]
1
1400
1600
1800
2000
Performance Results for Sun UltraSparc II 300MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
36
24
DDOT
DAXPY
23
34
22
21
20
30
MFLOP/s
MFLOP/s
32
28
19
18
17
26
16
24
15
22
14
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
60
500
DGEMV
DGEMM
450
55
400
50
45
MFLOP/s
MFLOP/s
350
40
300
250
200
35
150
30
100
25
50
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
42.2
10 13
Linpack BLAS
37.5
10 13
Level 2 BLAS
Right-looking
37.1
10 13
Left-looking
66.8
10 13
Crout
65.9
10 13
Level 3 BLAS
Lapack
285.8
10 13
Recursive LU
284.5
10 12
Code
1600
1800
kAx bk
kAkkxkn
6.5
6.5
8.9
7.9
8.0
8.8
9.0
38
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
300
Bus speed [MHz]
100
L1 cache [KB]
16+16
L2 cache [MB]
2
1400
1600
1800
2000
Performance Results for Sun UltraSparc II 250MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
26
18
DDOT
DAXPY
24
16
22
20
14
MFLOP/s
MFLOP/s
18
16
14
12
10
12
8
10
8
6
6
4
4
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
32
200
DGEMV
DGEMM
180
30
160
28
140
26
MFLOP/s
MFLOP/s
120
24
22
100
80
20
60
18
40
16
20
14
0
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
21.4
10 13
Linpack BLAS
18.2
10 13
Level 2 BLAS
Right-looking
18.4
10 13
Left-looking
28.0
10 12
Crout
28.8
10 12
Level 3 BLAS
Lapack
109.3
10 13
Recursive LU
110.3
10 12
Code
1600
1800
kAx bk
kAkkxkn
6.5
6.5
7.5
10.2
10.1
7.5
1.1
39
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
250
Bus speed [MHz]
83
L1 cache [KB]
16+16
L2 cache [MB]
1
1400
1600
1800
2000
Performance Results for PowerPC G4 533MHz
Level 1 BLAS: DDOT
Level 1 BLAS: DAXPY
85
55
DDOT
DAXPY
75
45
MFLOP/s
50
MFLOP/s
80
70
40
65
35
60
30
0
100
200
300
400
500
600
Vector dimension
700
800
900
1000
0
100
200
300
Level 2 BLAS: DGEMV
400
500
600
Vector dimension
700
800
900
1000
Level 3 BLAS: DGEMM
200
700
DGEMV
DGEMM
180
600
160
500
MFLOP/s
MFLOP/s
140
120
400
100
300
80
200
60
40
100
0
200
400
600
800
1000
Matrix dimension
1200
1400
MFLOP/s kAx bk
Level 1 BLAS
Linpack source
52.6
10 12
Linpack BLAS
76.2
10 12
Level 2 BLAS
Right-looking
85.2
10 13
Left-looking
125.2
10 13
Crout
120.6
10 13
Level 3 BLAS
Lapack
412.8
10 13
Recursive LU
477.6
10 12
Code
1600
1800
kAx bk
kAkkxkn
9.4
9.4
8.1
7.4
8.2
8.2
7.2
40
0
200
400
600
800
1000
1200
Matrix dimension
System parameters
Clock rate [MHz]
533
Bus speed [MHz]
100
L1 cache [KB]
32+32
L2 cache [MB]
2
1400
1600
1800
2000