Stream Computing for GPU-Accelerated HPC Applications Stream Computing for GPU-Accelerated HPC Applications David Richie Brown Deer Technology April 6th, 2009 Air Force Research Laboratory Wright-Patterson Air Force Base Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. Stream Computing for GPU-Accelerated HPC Applications Outline Modern GPUs Stream Computing Model Hardware/Software Test Setup Applications Electromagnetics: 3D FDTD (Maxwell’s Equations) Seismic: 3D VS-FDTD (Elastic Wave Equation) Quantum Chemistry: Two-Electron Integrals (STO-6G 1s) Molecular Dynamics: LAMMPS (PairLJCharmmCoulLong) State of the Technology Conclusions Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 2 Stream Computing for GPU-Accelerated HPC Applications Modern GPU Architectures FireStream 9250 • AMD RV770 Architecture • 800 SIMD superscalar processors • Supports SSE-like vec4 operations • IEEE single/double precision • 1 TFLOP peak single precision • 200 GFLOPS peak double-precision • 1 GB GDDR3 on-board memory • < 120 W max - 80 W typical • MSRP $999 Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 3 Stream Computing for GPU-Accelerated HPC Applications AMD/ATI GPU Form Factors Power clock[MHz] memory single double Radeon HD 4850($170) 625 1GB GDDR3 1.0 TFLOPS 200 GFLOPS Radeon HD 4870($250) 750 512MB GDDR5 1.2 TFLOPS 240 GFLOPS Radeon HD 4870X2($430) 750 2GB GDDR5 2.4 TFLOPS 480 GFLOPS Radeon HD 4890($250) 850 1GB GDDR5 1.36 TFLOPS 272 GFLOPS Radeon HD 4890OC($265) 900 1GB GDDR5 1.44 TFLOPS 288 GFLOPS FireStream 9250($999) 625 1GB GDDR3 1.0 TFLOPS 200 GFLOPS < 120W FireStream 9270($1499) 750 2GB GDDR5 1.2 TFLOPS 240 GFLOPS < 220W 1U Quad 9250 Server 625 4GB GDDR3 4.0 TFLOPS 800 GFLOPS < 480W 160W 190W HPC: *Data gathered April 4, 2009 Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 4 Stream Computing for GPU-Accelerated HPC Applications ATI Stream SDK • • • SDK (v1.3) Open-systems approach • Brook+ Compiler (C/C++ variant) BASIC • CAL low-level IL (generic ASM) EXPERT • CAL Run-Time API (C++) Stream paradigm: • Formulate algorithm as a SIMD kernel • Read/write streams between host/board Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 5 Stream Computing for GPU-Accelerated HPC Applications Stream Computing Pure Stream Computing: Elegant, not useful. Formulate algorithms based on the element-wise processing of multiple input streams into multiple output streams Pragmatic Stream Computing: Allows treatment of algorithms that do not fit a pure stream computing model – most algorithms fall in this category Allows scatter/gather memory access which is needed in most algorithms ATI Stream release of Brook+ compiler fits this model One or more computational kernels are applied to a 1D, 2D or 3D stream SIMT domain driven implicitly by the dimensions of the out put stream “Natural” streams are 2D, others use address translation Stream Computing Model Kernel In = Out Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 6 Stream Computing for GPU-Accelerated HPC Applications Brook+ Programming Model prog.cpp: ... float a[256]; float b[256]; float c[256]; for(i=0;i<N;i++) { c[i] = a[i]*b[i]; } ... prog.cpp~: ... float a[256]; float b[256]; float c[256]; Stream<float> s_a(1,N); Stream<float> s_b(1,N); Stream<float> s_c(1,N); // for(i=0;i<N;i++) // { c[i] = a[i]*b[i]; } s_a.read(a); s_b.read(b); foo_kern(s_a, s_b, s_c); s_c.write(c); ... Recipe for Brook+ acceleration: 1) 2) 3) 4) 5) foo.br: kernel void foo_kern( float A<>, float B<>, out float C<> ) { C = A * B; } Identify critical loop or section Create streams for datasets Replace loop or section with a) Data transfer Host-to-GPU b) Execution of GPU kernel(s) c) Data transfer GPU-to-Host Transform loop body or section into SIMD kernel, And move it to a brook+ file (.br) Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 7 Stream Computing for GPU-Accelerated HPC Applications Brook+ Workflow Makefile prog.cpp NAME = prog foo.br BRSRCS = foo.br OBJS += $(BRSRCS:.br=.o) INCS = -I/usr/local/atibrook/sdk/include LIBS = -L/usr/local/atibrook/sdk/lib -lbrook brcc BRCC = brcc foo_gpu.cpp,foo_gpu.h gcc prog.x libc.a ... libbrook.a all: $(NAME).x $(NAME).x: $(NAME).o $(OBJS) $(CXX) $(CXXFLAGS) $(INCS) -o $(NAME).x \ $(NAME).o $(OBJS) $(LIBS) .SUFFIXES: .SUFFIXES: .br .cpp .o .br.cpp: $(BRCC) $(BRCCFLAGS) -o $* $< .cpp.o: $(CXX) $(CXXFLAGS) $(INCS) -c $< Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 8 Stream Computing for GPU-Accelerated HPC Applications GPU Kernel: Memory Model Simple example: local weighted average of a 1D array f(i) g(i) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: kernel void waverage3_kern( float w0, float w1, float w2, float s_f[], out float s_g<> ) { float i1 = indexof(g); float i0 = i1 – 1.0f; float i2 = i1 + 1.0f; float f0 = f[i0]; float f1 = f[i1]; float f2 = f[i2]; float g = w0*f0 + w1*f1 + w2*f2; s_g = g; } (3) Simple scalar values, e.g., coefficients (4) Input (gather) stream, any element is accessible like a normal array (5) Output stream, kernel applied per element, only that element can be written (8) indexof() returns index of stream element for this kernel invocation (12-14) Gather (random) access needed for non-local stencil (16) Local calculation of weighted sum (18) Assign result to output stream, this must/can be done only once Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 9 Stream Computing for GPU-Accelerated HPC Applications Performance Optimizations: float4 data Simple Particle Pair Interactions (pseudo-code) for(i=0;i<natoms;i++) for(j=0;j<natoms;j++) { float4 pos0 = s_pos[i]; float4 pos1 = s_pos[j]; float4 d = pos1 – pos0; float rsq = d.x*d.x + d.y*d.y+ d.z*d.z; float f += (complicated function of rsq) for(i=0;i<natoms/4;i++) for(j=0;j<natoms/4;j++) { float4 x0 = s_x[i]; float4 y0 = s_y[i]; float4 z0 = s_z[i]; float4 x1 = s_x[j]; float4 y1 = s_y[j]; float4 z1 = s_z[j]; float4 dx = x1 – x0; float4 dy = y1 – y0; float4 dz = z1 – z0; float4 rsq = dx*dx + dy*dy+ dz*dz; float4 f += (complicated function of rsq) /* now repeat 3 times, swizzeling x1,y1,z1 */ float4 x1 = x1.yzwx; /* swizzel */ ... “Obvious” choice - store particle positions in float4, (x,y,z) ~ pos.xyz Creates scalar bottleneck Calculates force on only one particle per kernel invocation Instead, pack quantited into separate streams to better exploit SIMD ops x0.xyzw ~ x-position of four particles, etc. Entire calculation exploits SIMD ops Calculates force on four particles per kernel invocation (~4x speedup) Requires some tricks, e.g., swizzeling Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 10 Stream Computing for GPU-Accelerated HPC Applications Performance Optimizations: Shuffled Grids f0 f1 f2 f2 f1 f0 F0 F1 F2 Consider simple 1D stencil: g1 = a*f0 + b*f1 + c*f2 Would like to exploit float4 SIMD operations and update 4 points at once Problem: Stencils manipulate adjacent points ⇒ mixing components of the 4-vectors Solution: “shuffle” the grid as shown to align adjacent points in same 4-vector positions Consider float4 SIMD operation using the shuffled grid: G = c *F + c *F + c *F 1 0 0 1 1 2 2 Equivalent to performing: g1 = a*f0 + b*f1 + c*f2 g16 = a*f15 + b*f16 + c*f17 g31 = a*f30 + b*f31 + c*f32 g46 = a*f45 + b*f34 + c*f47 Shuffling/unshuffling the grids can be done as a pre- and post-processing step Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 11 Stream Computing for GPU-Accelerated HPC Applications Test Setup • • Hardware • Host system was a simple desktop (CPU overclocked) • Phenom X2 9950 Black Edition 3.0 GHz overclock. (15x Multiplier) • ASUS M3A78-T motherboard • 4 GB OCZ ReaperDDR2 1066 MHz (5-5-5-18) Memory • AMD Radeon HD 4870 (512 MB GDDR5) Software • Scientific Linux 5.1 • AMD SDK (Brook+ Compiler) v1.3 • GCC-4.1 (Results for older hardware/software will be noted) Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 12 Stream Computing for GPU-Accelerated HPC Applications Application Kernels • • Objectives • Evaluate representative computational kernels important in HPC • grids, finite-differencing, overlap integrals, particles • Understand GPU architecture, performance and optimisations • Understand how to design GPU-optimised stream applications Approach • Develop “clean” test codes, not full applications • Easy to instrument and modify • Exception is LAMMPS, a real production code from DOE/Sandia • Exercise was to investigate treatment of a “real code” • Brings complexity, e.g., data structures not GPU-friendly Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 13 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD Direct iterative solution of Maxwell’s Equations Important for modeling electromagnetic radiation from small devices to large-scale radar applications Grid-based finite-differencing Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 14 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD: Implementation GPU Acceleration Initialization N_step / N_burst Stream Read N_burst Update E Field Update E Field Update H Field Update H Field Apply Excitation Apply Excitation Stream Write Finalization Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 15 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD kernel void hcomp_gpu( float t, float nx, float ny, float nz4, float bx, float by, float bz, float4 EX[][], float4 EY[][], float4 EZ[][], float4 HX0<>, float4 HY0<>, float4 HZ0<>, out float4 HX<>, out float4 HY<>, out float4 HZ<> ) { const float4 zero4 = float4(0.0f,0.0f,0.0f,0.0f); const float4 one4 = float4(1.0f,1.0f,1.0f,1.0f); const float4 bx4 = float4(bx,bx,bx,bx); const float4 by4 = float4(by,by,by,by); const float4 bz4 = float4(bz,bz,bz,bz); float2 i000 = indexof(HX).xy; float4 ix float iy1 float4 iy float iz1 float4 iz float2 float2 float2 float2 = = = = = i100 i010 i001 i00o float4(i000.y,i000.y,i000.y,i000.y); floor(i000.x/nz4); float4(iy1,iy1,iy1,iy1); i000.x - iy1*nz4; float4(iz1,iz1,iz1,iz1); = = = = float2(i000.x,i000.y+1.0f); float2(i000.x+nz4,i000.y); float2(i000.x+1.0f,i000.y); float2(iy.x*nz4,i000.y); ... Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 16 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD ... float mx1 = nx-1.0f; float4 mx = float4(mx1,mx1,mx1,mx1); float my1 = ny-1.0f; float4 my = float4(my1,my1,my1,my1); float mz1 = nz4-1.0f; float4 mz4 = float4(mz1,mz1,mz1,mz1); float4 tmphx = HX0; float4 tmphy = HY0; float4 tmphz = HZ0; float4 float4 float4 float4 float4 float4 float4 float4 float4 float4 float4 ex00o ex000 ex001 ex010 ey00o ey000 ey001 ey100 ez000 ez010 ez100 = = = = = = = = = = = EX[i00o]; EX[i000]; EX[i001]; EX[i010]; EY[i00o]; EY[i000]; EY[i001]; EY[i100]; EZ[i000]; EZ[i010]; EZ[i100]; float4 maska,maskb,maskc; ... Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 17 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD ... float4 float4 float4 float4 maskx masky maskz maskg = = = = one4; one4; one4; float4(1.0f,1.0f,1.0f,0.0f); ex001 = (iz == mz4)? ex00o.yzwx : ex001; ey001 = (iz == mz4)? ey00o.yzwx : ey001; maska = (ix == mx)? zero4 : one4; maskb = (iy == my)? zero4 : one4; maskc = (iz == mz4)? maskg : one4; tmphx += bz4 * (ey001 - ey000) + by4 * (ez000 - ez010); tmphy += bx4 * (ez100 - ez000) + bz4 * (ex000 - ex001); tmphz += by4 * (ex010 - ex000) + bx4 * (ey000 - ey100); tmphx *= maskb*maskc; tmphy *= maska*maskc; tmphz *= maska*maskb; HX = tmphx; HY = tmphy; HZ = tmphz; } Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 18 Stream Computing for GPU-Accelerated HPC Applications Electromagnetics: 3D FDTD GPU vs. CPU: Time per Million Points Time [msec] 100 GPU-256x128x128 GPU-384x128x128 GPU-512x128x128 CPU-256x128x128 CPU-384x128x128 CPU-512x128x128 10 1 2 10 50 100 500 1000 5000 N_burst [steps] Performing many iterations in between data transfer mitigates PCIe bottleneck 28x speedup for largest grid Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 19 Stream Computing for GPU-Accelerated HPC Applications Seismic: 3D VS-FDTD Seismic Simulation of Velocity-Stress Wave Propagation Important algorithm for seismic forward modeling techniques Used for iterative refinement and validation of sub-surface geological models Commercial applications for oil and gas exploration Military applications for detecting buried structures Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 20 Stream Computing for GPU-Accelerated HPC Applications Seismic: 3D VS-FDTD GPU Acceleration Initialization N_step / N_burst Stream Read N_burst Update Velocity Field Update Velocity Field Update Stress Field Update Stress Field Apply Excitation Apply Excitation Stream Write Finalization Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 21 Stream Computing for GPU-Accelerated HPC Applications Seismic: 3D VS-FDTD: Benchmarks GPU vs. CPU: Time per Million Points Time [msec] 100 GPU-256x128x128 GPU-384x128x128 GPU-512x128x128 CPU-256x128x128 CPU-384x128x128 CPU-512x128x128 10 1 2 10 50 100 500 1000 5000 N_burst [steps] Results similar to electromagnetic FDTD 31x speedup for largest grid Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 22 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals One of the most common approaches in quantum chemical modeling employs gaussian basis sets to represent the electronic orbitals of the system A computationally costly component of these calculations involves the evaluation of twoelectron integrals For a gaussian basis, evaluation of two-electron integrals reduces to summation over closed-form expression (Boys, 1949) Features of expression required to be evaluated: Certain pair quantities can be factored and pre-calculated Expression contains +, -, *, /, sqrt(), exp(), erf() Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 23 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals GPU Acceleration Initialization Stream Read Pair Pre-Calc Pair Pre-Calc Calc 2-e Integrals Calc 2-e Integrals Stream Write Finalization Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 24 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals Implementation Details Consider simple test case: 3D lattice of Hydrogen atoms using a STO-6G basis (1s only) Evaluation of two-electron integrals reduces to many summations over 36••36= 1296 terms Use of float4 SIMD ops requires inner loop of only 36••9 iterations Use of double2 SIMD ops requires inner loop of only 36••18 iterations Most difficult part of implementation involved the erf() for which no hardware instr exists Most CPU-based codes us a piecewise approximation due to Cody (1968?) Good for CPUs, reduces FLOPS at expense of branching Terrible for GPUs, branching is a performance killer Used approximation by Hastings (1949?) valid for entire domain (with a few tricks) Quality of the erf() approximation warrants further investigation Benchmarks performed for various lattice dimensions (Nx,Ny,Nz) leading to wide span in terms of number of integrals evaluated Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 25 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals GPU vs. CPU: Time per Million 2-e Integrals Time [sec] 1000 GPU-fNL GPU-fFL GPU-fU GPU-dU GPU-dFLU2 CPU-f CPU-d 100 10 1 0.1 7.2E+04 6.9E+05 2.2E+06 6.4E+06 2.0E+07 3.2E+07 Number of Integrals Various implementations: float(f)/double(d), Nested-Loop (NL), Fused-Loop (FL), Unrolled (U) Results are complex, reveal a lot about the architecture and run-time API Best float implementation: fully unrolled loop (9 iterations) Best double implementation: fused-loop w/partial (2 iteration) unroll Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 26 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals GPU vs. CPU: Time per Million 2-e Integrals Time [sec] 1000 GPU-fU GPU-fU-s2 GPU-dFLU2 GPU-dFLU2-s2 CPU-f CPU-d 100 10 1 0.1 7.2E+04 1.1E+06 6.4E+06 2.0E+07 Number of Integrals Large numbers of integrals: latency and GPU setup time is completely amortized Small numbers of integrals: repeating calculation (s2) reveals GPU setup/compute time Entire calculation is repeated including complete data transfer s2 time more reflective of real codes (integrals re-evaluate repeatedly) Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 27 Stream Computing for GPU-Accelerated HPC Applications Quantum Chemistry: Two-Electron Integrals STO-6G(1s) 4x4x4 ATI/4870/single AMD/9950(3GHz)/single ATI/4870/double AMD/9950(3GHz)/double Nvidia/8800GTX/single* AMD/175/GAMESS* Total 0.968 sec 236.242 sec 244x GPU Setup 0.678 sec 2.728 sec 198.749 sec 72x 0.241 sec GPU Compute 0.290 sec 814x 2.487 sec 80x 1.123 sec 90.6 sec *Ufimtsev and Martinez Large number of integral limit (~10 million) SP: 774x speedup DP: 77x speedup CPU implementation definitely not optimized GPU performance/speedup will nevertheless be substantial Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 28 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Rhodopsin Protein Fundamental technique for molecular modeling Simulate motion of particles subject to inter-particle forces LAMMPS is open-source MD code from DOE/Sandia Dr. Steve Plimpton, http://lammps.sandia.gov Goal: accelerate inter-particle force calculation *Original work due to Paul Crozier and Mark Stevens at Sandia National Labs Rhodopsin Protein Benchmark (most difficult) Details: All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, longrange Coulomb via PPPM, SHAKE constraints, system contains counter-ions and a reduced amount of water Benchmark: 32,000 atoms for 100 timesteps Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 29 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS GPU Acceleration Initialization N_step / N_step_nn NN Calc N_step_nn Stream Read pos,vel Pair Potential Pair Potential Stream Write Propagator Finalization Note: Older results (July 2008) using FireStream 9170 and ATI Stream SDK v1.1 Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 30 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Implementation Details Only pair potential calculation moved to GPGPU ( ~> 80% run time on CPU) Specifically: PairLJCharmmCoulLong::compute() Basic algorithm: “foreach atom-i calculate force from atom-j” Atom-i accessed in-order, atom-j accessed out-of-order Pairs defined by pre-calculated nearest-neighbor list (updated periodically) CPU efficiency achieved by using “half list” such that j > i Eliminates redundant force calculations Cannot be done with GPU/Brook+ due to out-of-order writeback Must use “full list” on GPU (~ 2x penalty) LAMMPS neighbor list calculation modified to generate “full list” Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 31 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Implementation (More) Details Host-side details: Pair potential compute function intercepted with call to special GPGPU function Nearest-neighbor list re-packed and sent to board (only if new) Position/charge/type arrays repacked into GPGPU format and sent to board Per-particle kernel called Force array read back and unpacked into LAMMPS format Energies and virial accumulated on CPU (reduce kernel slower than CPU) GPU per-atom kernel details: Used 2D arrays accept for neighbor list Neighbor list used large 1D buffer(s) (no gain from use of 2D array) Neighbor list padded modulo 8 (per-atom) to allow concurrent force updates Calculated 4 force contributions per loop (no gain from 8) Neighbor list larger than max stream (float4 <4194304>), broken up into 8 lists Force update performed using 8 successive kernel invocations Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 32 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Benchmark Tests General: Single-core performance benchmarks GPGPU implementation single-precision 32,000 atoms, 100 timesteps (standard LAMMPS benchmark) Test #1: GPGPU Test #2: CPU (“identical” algorithm, identical model) Pair Potential calc on CPU, half neighbor list, newton=off, no Coulomb table Test #4: CPU (optimized algorithm, optimized model) Pair Potential calc on CPU, full neighbor list, newton=off, no Coulomb table Test #3: CPU (optimized algorithm, identical model) Pair Potential calc on GPGPU, full neighbor list, newton=off, no Coulomb table Direct comparison (THEORY) Architecture Optimized (REALITY) Pair Potential calc on CPU, half neighbor list, newton=on, Coulomb table ASCI RED single-core performance (from LAMMPS website) Most likely a Test #4, included here for reference Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 33 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Rhodopsin Benchmark Other Neighbor Calc Potential Calc Total 250 200 150 100 50 0 Firestream Athlon 64 Athlon 64 Athlon 64 9170 Test X2 3.2GHz X2 3.2GHz X2 3.2GHz #1 Test #2 Test #3 Test #4 ASCI RED Xeon 2.66GHz Amadahl’s Law: Pair Potential compared with total time: 35%(Test#1), 75%(Test#2), 83%(Test#4) Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 34 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Rhodopsin Benchmark Speedup Using Firestream 9170 vs. CPU Potential Calc Only Total Run Time 10 Speedup 8 6 4 2 0 Athlon 64 X2 Athlon 64 X2 Athlon 64 X2 3.2GHz Test 3.2GHz Test 3.2GHz Test #2 #3 #4 ASCI RED Xeon 2.66GHz Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 35 Stream Computing for GPU-Accelerated HPC Applications Molecular Dynamics: LAMMPS Rhodopsin Benchmark Effective* Floating-Point Performance Potential Calc Only Total Throuput 12 GFLOPS 10 8 6 4 2 0 Firestream Athlon 64 X2 Athlon 64 X2 Athlon 64 X2 9170 Test #1 3.2GHz Test 3.2GHz Test 3.2GHz Test #2 #3 #4 ASCI RED Xeon 2.66GHz Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 36 Stream Computing for GPU-Accelerated HPC Applications Conclusions GPUs provide tremendous raw floating-point performance Compiler technology remains immature, however, it is relatively easy to accelerate real algorithms No longer limited to single-precision The days of trying to re-factor physics equations into OpenGL are over The GPU technology is advancing (in terms of price/performance, performance/power) very rapidly The commercial market that really drives this technology (video games) will not slow down This presents a very useful, fortunate situation for scientists and engineers who wish to exploit this technology Contact: [email protected] Copyright © 2009 Brown Deer Technology, LLC. All Rights Reserved. 37