Stream Computing for GPU-Accelerated HPC Applications AFRL (Wright-Patterson)

Stream Computing for GPU-Accelerated HPC Applications
Stream Computing for GPU-Accelerated
HPC Applications
David Richie
Brown Deer Technology
April 6th, 2009
Air Force Research Laboratory
Wright-Patterson Air Force Base
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
Stream Computing for GPU-Accelerated HPC Applications
Outline
Modern GPUs
Stream Computing Model
Hardware/Software Test Setup
Applications
Electromagnetics: 3D FDTD (Maxwell’s Equations)
Seismic: 3D VS-FDTD (Elastic Wave Equation)
Quantum Chemistry: Two-Electron Integrals (STO-6G 1s)
Molecular Dynamics: LAMMPS (PairLJCharmmCoulLong)
State of the Technology
Conclusions
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
2
Stream Computing for GPU-Accelerated HPC Applications
Modern GPU Architectures
FireStream 9250
•
AMD RV770 Architecture
•
800 SIMD superscalar processors
•
Supports SSE-like vec4 operations
•
IEEE single/double precision
•
1 TFLOP peak single precision
•
200 GFLOPS peak double-precision
•
1 GB GDDR3 on-board memory
•
< 120 W max - 80 W typical
•
MSRP $999
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
3
Stream Computing for GPU-Accelerated HPC Applications
AMD/ATI GPU Form Factors
Power
clock[MHz]
memory
single
double
Radeon HD 4850($170)
625
1GB GDDR3
1.0 TFLOPS
200 GFLOPS
Radeon HD 4870($250)
750
512MB GDDR5
1.2 TFLOPS
240 GFLOPS
Radeon HD 4870X2($430)
750
2GB GDDR5
2.4 TFLOPS
480 GFLOPS
Radeon HD 4890($250)
850
1GB GDDR5
1.36 TFLOPS
272 GFLOPS
Radeon HD 4890OC($265)
900
1GB GDDR5
1.44 TFLOPS
288 GFLOPS
FireStream 9250($999)
625
1GB GDDR3
1.0 TFLOPS
200 GFLOPS
< 120W
FireStream 9270($1499)
750
2GB GDDR5
1.2 TFLOPS
240 GFLOPS
< 220W
1U Quad 9250 Server
625
4GB GDDR3
4.0 TFLOPS
800 GFLOPS
< 480W
160W
190W
HPC:
*Data gathered April 4, 2009
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
4
Stream Computing for GPU-Accelerated HPC Applications
ATI Stream SDK
•
•
•
SDK (v1.3)
Open-systems approach
•
Brook+ Compiler (C/C++ variant) BASIC
•
CAL low-level IL (generic ASM) EXPERT
•
CAL Run-Time API (C++)
Stream paradigm:
•
Formulate algorithm as a SIMD kernel
•
Read/write streams between host/board
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
5
Stream Computing for GPU-Accelerated HPC Applications
Stream Computing
Pure Stream Computing: Elegant, not useful.
Formulate algorithms based on the element-wise processing of multiple
input streams into multiple output streams
Pragmatic Stream Computing: Allows treatment of algorithms that do not fit a
pure stream computing model – most algorithms fall in this category
Allows scatter/gather memory access which is needed in most algorithms
ATI Stream release of Brook+ compiler fits this model
One or more computational kernels are applied to a 1D, 2D or 3D stream
SIMT domain driven implicitly by the dimensions of the out put stream
“Natural” streams are 2D, others use address translation
Stream Computing Model
Kernel
In
=
Out
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
6
Stream Computing for GPU-Accelerated HPC Applications
Brook+ Programming Model
prog.cpp:
...
float a[256];
float b[256];
float c[256];
for(i=0;i<N;i++)
{ c[i] = a[i]*b[i]; }
...
prog.cpp~:
...
float a[256];
float b[256];
float c[256];
Stream<float> s_a(1,N);
Stream<float> s_b(1,N);
Stream<float> s_c(1,N);
// for(i=0;i<N;i++)
// { c[i] = a[i]*b[i]; }
s_a.read(a);
s_b.read(b);
foo_kern(s_a, s_b, s_c);
s_c.write(c);
...
Recipe for Brook+ acceleration:
1)
2)
3)
4)
5)
foo.br:
kernel void
foo_kern(
float A<>,
float B<>,
out float
C<>
)
{ C = A * B; }
Identify critical loop or section
Create streams for datasets
Replace loop or section with
a) Data transfer Host-to-GPU
b) Execution of GPU kernel(s)
c) Data transfer GPU-to-Host
Transform loop body or section into SIMD kernel,
And move it to a brook+ file (.br)
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
7
Stream Computing for GPU-Accelerated HPC Applications
Brook+ Workflow
Makefile
prog.cpp
NAME = prog
foo.br
BRSRCS = foo.br
OBJS += $(BRSRCS:.br=.o)
INCS = -I/usr/local/atibrook/sdk/include
LIBS = -L/usr/local/atibrook/sdk/lib -lbrook
brcc
BRCC = brcc
foo_gpu.cpp,foo_gpu.h
gcc
prog.x
libc.a
...
libbrook.a
all: $(NAME).x
$(NAME).x: $(NAME).o $(OBJS)
$(CXX) $(CXXFLAGS) $(INCS) -o
$(NAME).x \
$(NAME).o $(OBJS) $(LIBS)
.SUFFIXES:
.SUFFIXES: .br .cpp .o
.br.cpp:
$(BRCC) $(BRCCFLAGS) -o $* $<
.cpp.o:
$(CXX) $(CXXFLAGS) $(INCS) -c $<
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
8
Stream Computing for GPU-Accelerated HPC Applications
GPU Kernel: Memory Model
Simple example: local weighted average of a 1D array
f(i)
g(i)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
kernel void
waverage3_kern(
float w0, float w1, float w2,
float s_f[],
out float s_g<>
)
{
float i1 = indexof(g);
float i0 = i1 – 1.0f;
float i2 = i1 + 1.0f;
float f0 = f[i0];
float f1 = f[i1];
float f2 = f[i2];
float g = w0*f0 + w1*f1 + w2*f2;
s_g = g;
}
(3) Simple scalar values, e.g., coefficients
(4) Input (gather) stream, any element is
accessible like a normal array
(5) Output stream, kernel applied per
element, only that element can be
written
(8) indexof() returns index of stream
element for this kernel invocation
(12-14) Gather (random) access needed for
non-local stencil
(16) Local calculation of weighted sum
(18) Assign result to output stream, this
must/can be done only once
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
9
Stream Computing for GPU-Accelerated HPC Applications
Performance Optimizations: float4 data
Simple Particle Pair Interactions
(pseudo-code)
for(i=0;i<natoms;i++) for(j=0;j<natoms;j++) {
float4 pos0 = s_pos[i];
float4 pos1 = s_pos[j];
float4 d = pos1 – pos0;
float rsq = d.x*d.x + d.y*d.y+ d.z*d.z;
float f += (complicated function of rsq)
for(i=0;i<natoms/4;i++) for(j=0;j<natoms/4;j++) {
float4 x0 = s_x[i];
float4 y0 = s_y[i];
float4 z0 = s_z[i];
float4 x1 = s_x[j];
float4 y1 = s_y[j];
float4 z1 = s_z[j];
float4 dx = x1 – x0;
float4 dy = y1 – y0;
float4 dz = z1 – z0;
float4 rsq = dx*dx + dy*dy+ dz*dz;
float4 f += (complicated function of rsq)
/* now repeat 3 times, swizzeling x1,y1,z1 */
float4 x1 = x1.yzwx; /* swizzel */
...
“Obvious” choice - store particle
positions in float4, (x,y,z) ~ pos.xyz
Creates scalar bottleneck
Calculates force on only one particle per
kernel invocation
Instead, pack quantited into separate
streams to better exploit SIMD ops
x0.xyzw ~ x-position of four particles, etc.
Entire calculation exploits SIMD ops
Calculates force on four particles per
kernel invocation (~4x speedup)
Requires some tricks, e.g., swizzeling
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
10
Stream Computing for GPU-Accelerated HPC Applications
Performance Optimizations: Shuffled Grids
f0 f1 f2
f2
f1
f0
F0
F1
F2
Consider simple 1D stencil: g1 = a*f0 + b*f1 + c*f2
Would like to exploit float4 SIMD operations and update 4 points at once
Problem: Stencils manipulate adjacent points ⇒ mixing components of the 4-vectors
Solution: “shuffle” the grid as shown to align adjacent points in same 4-vector positions
Consider float4 SIMD operation using the shuffled grid: G = c *F + c *F + c *F
1
0 0
1 1
2 2
Equivalent to performing:
g1 = a*f0 + b*f1 + c*f2
g16 = a*f15 + b*f16 + c*f17
g31 = a*f30 + b*f31 + c*f32
g46 = a*f45 + b*f34 + c*f47
Shuffling/unshuffling the grids can be done as a pre- and post-processing step
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
11
Stream Computing for GPU-Accelerated HPC Applications
Test Setup
•
•
Hardware
•
Host system was a simple desktop (CPU overclocked)
•
Phenom X2 9950 Black Edition 3.0 GHz overclock. (15x Multiplier)
•
ASUS M3A78-T motherboard
•
4 GB OCZ ReaperDDR2 1066 MHz (5-5-5-18) Memory
•
AMD Radeon HD 4870 (512 MB GDDR5)
Software
•
Scientific Linux 5.1
•
AMD SDK (Brook+ Compiler) v1.3
•
GCC-4.1
(Results for older hardware/software will be noted)
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
12
Stream Computing for GPU-Accelerated HPC Applications
Application Kernels
•
•
Objectives
• Evaluate representative computational kernels important in HPC
• grids, finite-differencing, overlap integrals, particles
• Understand GPU architecture, performance and optimisations
• Understand how to design GPU-optimised stream applications
Approach
• Develop “clean” test codes, not full applications
• Easy to instrument and modify
• Exception is LAMMPS, a real production code from DOE/Sandia
• Exercise was to investigate treatment of a “real code”
• Brings complexity, e.g., data structures not GPU-friendly
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
13
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD
Direct iterative solution of Maxwell’s
Equations
Important for modeling
electromagnetic radiation from
small devices to large-scale radar
applications
Grid-based finite-differencing
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
14
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD: Implementation
GPU Acceleration
Initialization
N_step / N_burst
Stream Read
N_burst
Update E Field
Update E Field
Update H Field
Update H Field
Apply Excitation
Apply Excitation
Stream Write
Finalization
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
15
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD
kernel void hcomp_gpu(
float t,
float nx, float ny, float nz4,
float bx, float by, float bz,
float4 EX[][], float4 EY[][], float4 EZ[][],
float4 HX0<>, float4 HY0<>, float4 HZ0<>,
out float4 HX<>, out float4 HY<>, out float4 HZ<>
) {
const float4 zero4 = float4(0.0f,0.0f,0.0f,0.0f);
const float4 one4 = float4(1.0f,1.0f,1.0f,1.0f);
const float4 bx4 = float4(bx,bx,bx,bx);
const float4 by4 = float4(by,by,by,by);
const float4 bz4 = float4(bz,bz,bz,bz);
float2 i000 = indexof(HX).xy;
float4 ix
float iy1
float4 iy
float iz1
float4 iz
float2
float2
float2
float2
=
=
=
=
=
i100
i010
i001
i00o
float4(i000.y,i000.y,i000.y,i000.y);
floor(i000.x/nz4);
float4(iy1,iy1,iy1,iy1);
i000.x - iy1*nz4;
float4(iz1,iz1,iz1,iz1);
=
=
=
=
float2(i000.x,i000.y+1.0f);
float2(i000.x+nz4,i000.y);
float2(i000.x+1.0f,i000.y);
float2(iy.x*nz4,i000.y);
...
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
16
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD
...
float mx1 = nx-1.0f;
float4 mx = float4(mx1,mx1,mx1,mx1);
float my1 = ny-1.0f;
float4 my = float4(my1,my1,my1,my1);
float mz1 = nz4-1.0f;
float4 mz4 = float4(mz1,mz1,mz1,mz1);
float4 tmphx = HX0;
float4 tmphy = HY0;
float4 tmphz = HZ0;
float4
float4
float4
float4
float4
float4
float4
float4
float4
float4
float4
ex00o
ex000
ex001
ex010
ey00o
ey000
ey001
ey100
ez000
ez010
ez100
=
=
=
=
=
=
=
=
=
=
=
EX[i00o];
EX[i000];
EX[i001];
EX[i010];
EY[i00o];
EY[i000];
EY[i001];
EY[i100];
EZ[i000];
EZ[i010];
EZ[i100];
float4 maska,maskb,maskc;
...
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
17
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD
...
float4
float4
float4
float4
maskx
masky
maskz
maskg
=
=
=
=
one4;
one4;
one4;
float4(1.0f,1.0f,1.0f,0.0f);
ex001 = (iz == mz4)? ex00o.yzwx : ex001;
ey001 = (iz == mz4)? ey00o.yzwx : ey001;
maska = (ix == mx)? zero4 : one4;
maskb = (iy == my)? zero4 : one4;
maskc = (iz == mz4)? maskg : one4;
tmphx += bz4 * (ey001 - ey000) + by4 * (ez000 - ez010);
tmphy += bx4 * (ez100 - ez000) + bz4 * (ex000 - ex001);
tmphz += by4 * (ex010 - ex000) + bx4 * (ey000 - ey100);
tmphx *= maskb*maskc;
tmphy *= maska*maskc;
tmphz *= maska*maskb;
HX = tmphx;
HY = tmphy;
HZ = tmphz;
}
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
18
Stream Computing for GPU-Accelerated HPC Applications
Electromagnetics: 3D FDTD
GPU vs. CPU: Time per Million Points
Time [msec]
100
GPU-256x128x128
GPU-384x128x128
GPU-512x128x128
CPU-256x128x128
CPU-384x128x128
CPU-512x128x128
10
1
2
10
50
100
500
1000
5000
N_burst [steps]
Performing many iterations in between data transfer mitigates PCIe bottleneck
28x speedup for largest grid
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
19
Stream Computing for GPU-Accelerated HPC Applications
Seismic: 3D VS-FDTD
Seismic Simulation of Velocity-Stress Wave Propagation
Important algorithm for seismic forward modeling techniques
Used for iterative refinement and validation of sub-surface geological models
Commercial applications for
oil and gas exploration
Military applications for detecting
buried structures
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
20
Stream Computing for GPU-Accelerated HPC Applications
Seismic: 3D VS-FDTD
GPU Acceleration
Initialization
N_step / N_burst
Stream Read
N_burst
Update Velocity Field
Update Velocity Field
Update Stress Field
Update Stress Field
Apply Excitation
Apply Excitation
Stream Write
Finalization
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
21
Stream Computing for GPU-Accelerated HPC Applications
Seismic: 3D VS-FDTD: Benchmarks
GPU vs. CPU: Time per Million Points
Time [msec]
100
GPU-256x128x128
GPU-384x128x128
GPU-512x128x128
CPU-256x128x128
CPU-384x128x128
CPU-512x128x128
10
1
2
10
50
100
500
1000
5000
N_burst [steps]
Results similar to electromagnetic FDTD
31x speedup for largest grid
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
22
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
One of the most common approaches in
quantum chemical modeling employs gaussian
basis sets to represent the electronic orbitals
of the system
A computationally costly component of these
calculations involves the evaluation of twoelectron integrals
For a gaussian basis, evaluation of two-electron integrals reduces to summation
over closed-form expression (Boys, 1949)
Features of expression required to be evaluated:
Certain pair quantities can be factored and pre-calculated
Expression contains +, -, *, /, sqrt(), exp(), erf()
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
23
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
GPU Acceleration
Initialization
Stream Read
Pair Pre-Calc
Pair Pre-Calc
Calc 2-e Integrals
Calc 2-e Integrals
Stream Write
Finalization
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
24
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
Implementation Details
Consider simple test case: 3D lattice of Hydrogen atoms using a STO-6G basis (1s only)
Evaluation of two-electron integrals reduces to many summations over 36••36= 1296 terms
Use of float4 SIMD ops requires inner loop of only 36••9 iterations
Use of double2 SIMD ops requires inner loop of only 36••18 iterations
Most difficult part of implementation involved the erf() for which no hardware instr exists
Most CPU-based codes us a piecewise approximation due to Cody (1968?)
Good for CPUs, reduces FLOPS at expense of branching
Terrible for GPUs, branching is a performance killer
Used approximation by Hastings (1949?) valid for entire domain (with a few tricks)
Quality of the erf() approximation warrants further investigation
Benchmarks performed for various lattice dimensions (Nx,Ny,Nz) leading to wide span in
terms of number of integrals evaluated
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
25
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
GPU vs. CPU: Time per Million 2-e Integrals
Time [sec]
1000
GPU-fNL
GPU-fFL
GPU-fU
GPU-dU
GPU-dFLU2
CPU-f
CPU-d
100
10
1
0.1
7.2E+04 6.9E+05 2.2E+06 6.4E+06 2.0E+07 3.2E+07
Number of Integrals
Various implementations:
float(f)/double(d), Nested-Loop (NL), Fused-Loop (FL), Unrolled (U)
Results are complex, reveal a lot about the architecture and run-time API
Best float implementation: fully unrolled loop (9 iterations)
Best double implementation: fused-loop w/partial (2 iteration) unroll
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
26
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
GPU vs. CPU: Time per Million 2-e Integrals
Time [sec]
1000
GPU-fU
GPU-fU-s2
GPU-dFLU2
GPU-dFLU2-s2
CPU-f
CPU-d
100
10
1
0.1
7.2E+04
1.1E+06
6.4E+06
2.0E+07
Number of Integrals
Large numbers of integrals: latency and GPU setup time is completely amortized
Small numbers of integrals: repeating calculation (s2) reveals GPU setup/compute time
Entire calculation is repeated including complete data transfer
s2 time more reflective of real codes (integrals re-evaluate repeatedly)
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
27
Stream Computing for GPU-Accelerated HPC Applications
Quantum Chemistry: Two-Electron Integrals
STO-6G(1s) 4x4x4
ATI/4870/single
AMD/9950(3GHz)/single
ATI/4870/double
AMD/9950(3GHz)/double
Nvidia/8800GTX/single*
AMD/175/GAMESS*
Total
0.968 sec
236.242 sec
244x
GPU Setup
0.678 sec
2.728 sec
198.749 sec
72x
0.241 sec
GPU Compute
0.290 sec
814x
2.487 sec
80x
1.123 sec
90.6 sec
*Ufimtsev and Martinez
Large number of integral limit (~10 million)
SP: 774x speedup
DP: 77x speedup
CPU implementation definitely not optimized
GPU performance/speedup will nevertheless be substantial
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
28
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Rhodopsin Protein
Fundamental technique for molecular modeling
Simulate motion of particles subject to inter-particle
forces
LAMMPS is open-source MD code from DOE/Sandia
Dr. Steve Plimpton, http://lammps.sandia.gov
Goal: accelerate inter-particle force calculation
*Original work due to Paul Crozier and
Mark Stevens at Sandia National Labs
Rhodopsin Protein Benchmark (most difficult)
Details: All-atom rhodopsin protein in solvated
lipid bilayer with CHARMM force field, longrange Coulomb via PPPM, SHAKE constraints,
system contains counter-ions and a reduced
amount of water
Benchmark: 32,000 atoms for 100 timesteps
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
29
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
GPU Acceleration
Initialization
N_step / N_step_nn
NN Calc
N_step_nn
Stream Read pos,vel
Pair Potential
Pair Potential
Stream Write
Propagator
Finalization
Note: Older results (July 2008) using FireStream 9170
and ATI Stream SDK v1.1
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
30
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Implementation Details
Only pair potential calculation moved to GPGPU ( ~> 80% run time on CPU)
Specifically: PairLJCharmmCoulLong::compute()
Basic algorithm: “foreach atom-i calculate force from atom-j”
Atom-i accessed in-order, atom-j accessed out-of-order
Pairs defined by pre-calculated nearest-neighbor list (updated periodically)
CPU efficiency achieved by using “half list” such that j > i
Eliminates redundant force calculations
Cannot be done with GPU/Brook+ due to out-of-order writeback
Must use “full list” on GPU (~ 2x penalty)
LAMMPS neighbor list calculation modified to generate “full list”
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
31
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Implementation (More) Details
Host-side details:
Pair potential compute function intercepted with call to special GPGPU function
Nearest-neighbor list re-packed and sent to board (only if new)
Position/charge/type arrays repacked into GPGPU format and sent to board
Per-particle kernel called
Force array read back and unpacked into LAMMPS format
Energies and virial accumulated on CPU (reduce kernel slower than CPU)
GPU per-atom kernel details:
Used 2D arrays accept for neighbor list
Neighbor list used large 1D buffer(s) (no gain from use of 2D array)
Neighbor list padded modulo 8 (per-atom) to allow concurrent force updates
Calculated 4 force contributions per loop (no gain from 8)
Neighbor list larger than max stream (float4 <4194304>), broken up into 8 lists
Force update performed using 8 successive kernel invocations
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
32
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Benchmark Tests
General:
Single-core performance benchmarks
GPGPU implementation single-precision
32,000 atoms, 100 timesteps (standard LAMMPS benchmark)
Test #1: GPGPU
Test #2: CPU (“identical” algorithm, identical model)
Pair Potential calc on CPU, half neighbor list, newton=off, no Coulomb table
Test #4: CPU (optimized algorithm, optimized model)
Pair Potential calc on CPU, full neighbor list, newton=off, no Coulomb table
Test #3: CPU (optimized algorithm, identical model)
Pair Potential calc on GPGPU, full neighbor list, newton=off, no Coulomb table
Direct comparison (THEORY)
Architecture Optimized (REALITY)
Pair Potential calc on CPU, half neighbor list, newton=on, Coulomb table
ASCI RED single-core performance (from LAMMPS website)
Most likely a Test #4, included here for reference
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
33
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Rhodopsin Benchmark
Other
Neighbor Calc
Potential Calc
Total
250
200
150
100
50
0
Firestream Athlon 64 Athlon 64 Athlon 64
9170 Test X2 3.2GHz X2 3.2GHz X2 3.2GHz
#1
Test #2
Test #3
Test #4
ASCI RED
Xeon
2.66GHz
Amadahl’s Law: Pair Potential compared with total time: 35%(Test#1), 75%(Test#2), 83%(Test#4)
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
34
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Rhodopsin Benchmark
Speedup Using Firestream 9170 vs. CPU
Potential Calc Only
Total Run Time
10
Speedup
8
6
4
2
0
Athlon 64 X2 Athlon 64 X2 Athlon 64 X2
3.2GHz Test 3.2GHz Test 3.2GHz Test
#2
#3
#4
ASCI RED
Xeon
2.66GHz
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
35
Stream Computing for GPU-Accelerated HPC Applications
Molecular Dynamics: LAMMPS
Rhodopsin Benchmark
Effective* Floating-Point Performance
Potential Calc Only
Total Throuput
12
GFLOPS
10
8
6
4
2
0
Firestream Athlon 64 X2 Athlon 64 X2 Athlon 64 X2
9170 Test #1 3.2GHz Test 3.2GHz Test 3.2GHz Test
#2
#3
#4
ASCI RED
Xeon
2.66GHz
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
36
Stream Computing for GPU-Accelerated HPC Applications
Conclusions
GPUs provide tremendous raw floating-point performance
Compiler technology remains immature, however, it is relatively easy to
accelerate real algorithms
No longer limited to single-precision
The days of trying to re-factor physics equations into OpenGL are over
The GPU technology is advancing (in terms of price/performance,
performance/power) very rapidly
The commercial market that really drives this technology (video games) will not
slow down
This presents a very useful, fortunate situation for scientists and engineers who
wish to exploit this technology
Contact: [email protected]
Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved.
37