Practical Parallel Rendering with DirectX 9 and 10 Windows PC Command Buffers Vincent Scheib Architect, Gamebryo Emergent Game Technologies Foundational technology, over 200 shipped titles, more than 13 genres, and multiple platforms. Civilization 4 Munch's Oddysee PC Sid Meier's Pirates! Barbie Digital Makeover Zero Cup Soccer TITLES Sim Patient Elder Scrolls Dark Age of Camelot Action Strategy Xbox 360 GC Tetris Worlds Family GENRES PS3 Futurama Crash Racing Sports PLATFORMS PS2 Adventure Vis / Sim MMO RPG Xbox Wii Platformer Racing Puzzle Customers Introduction • Take advantage of multiple cores with parallel rendering Performance Ratio • Performance should scale by number of cores 4 3 2 Observed data from this project, details follow 1 0 Single Core Dual Core - Quad Core Presentation Outline • Motivation and problem definition • Command buffers – Requirements – Implementation – Handling effects and resources • • • • Application models Integrating to existing code Prototype results Future work Motivation • Take advantage of multi core machines – 40% machines have 2+ physical CPUs (steamJul08) • Rendering can have high CPU cost • Direct3D 11 display lists coming, but want support for Direct3D 9 and 10 now – Currently 81% DX9 HW, 9% DX10 HW (steamJul08) – Rough DX9 HW forecast: 2011 ~30% (emergent) – Asia HW trends lag somewhat Multithreaded DX Device? • DirectX 9 and 10 primarily designed for single-threaded game architectures • Multithreaded mode incurs overhead – Cuts FPS roughly in half on DX9 for a CPU render call bound application • DX is Stateful – Requires additional synchronization for parallel rendering Ideal Scenario • One thread per hardware thread • Application manages dispatching work to multiple threads • Rendering data completely prepared, ready to be sent to single-threaded D3D device – Function calls, conditionals, and final matrix multiplies are wasted time on a D3D device thread Reality • Update() – Seldomly generates coherent data in API specific format. • Render() – Some work done between calls to DirectX API Going Wide Main Thread Worker Thread Worker Thread Worker Thread Update Render Command Buffers • Record calls to D3D – Store in a command buffer – Can be done concurrently on multiple threads, to multiple command buffers • Playback D3D commands – Efficiently on main thread – Exact data for DX API – Coherent in memory • Clean and modular point to integrate to application Command Buffer Requirements • Minimal modifications to rendering code – Most code uses pointer to D3DDevice – Parameters from stack, e.g., D3DRECT – Support most of the device API • Draw calls, setting state, constants, shaders, textures, stream source, and so on – Support effects • Playback does not modify buffer • Playback is ideal performance Command Buffer Allowances • No support for: – Create methods – Get methods – Miscellaneous other functions that return values • QueryInterface, ShowCursor Command Buffer: Nice to Have • Buffers played back multiple times • Optimization of buffers – Remove redundant state calls • Offload main thread by doing this on recorder threads – Reordering of sort independent draw calls Design: Recording • Wrap every API call – Unsupported calls, return error – Supported calls • Store enumeration for call into buffer • Store parameters into buffer • Make copies of non-reference counted objects such as D3DMATRIX, D3DRECT, shader constants, and so on Design: Playback • Playback, read from buffer, and – select function call pointer from table given token – each playback function unpacks parameters buffer Recording Example virtual HRESULT STDMETHODCALLTYPE DrawPrimitive( D3DPRIMITIVETYPE PrimitiveType, UINT StartVertex, UINT PrimitiveCount) { m_pCommandBuffer->Put(CBD3D_COMMANDS::DrawPrimitive); m_pCommandBuffer->Put(PrimitiveType); m_pCommandBuffer->Put(StartVertex); m_pCommandBuffer->Put(PrimitiveCount); return D3D_OK; } Playback Example void CBPlayer9::DoDrawPrimitive() { D3DPRIMITIVETYPE arg1; m_pCommandBuffer->Get(&arg1); UINT arg2; m_pCommandBuffer->Get(&arg2); UINT arg3; m_pCommandBuffer->Get(&arg3); if(FAILED(m_pDevice->DrawPrimitive(arg1, arg2, arg3))) OutputDebugStringA(__FUNCTION__ " failed in playback\n"); } Effects: Problem • Effect takes pointer to device at creation • Effect then creates resources • At render, effect should use our recorder • Our recording device cannot create resources Effects: Solutions 1. Create FX with command buffer device • 2. Fails: needs real device for initialization Wrap and record FX calls and play them back • 3. Inefficient Give FX EffectStateManager class to redirect calls to command buffer, give it real device for initialization • 4. Disables FX use of state blocks Create redirecting device • Acts as real device at init, command buffer device at render time Resource Management • Multiple threads wish to: – Create resources (e.g., background loading) – Update resources (e.g., dynamic geometry) • App must use playback thread only to modify resources – App specific logic • Deferred creation, double buffering – Support in command buffers (next slide) Resource Management (2) • Command buffer library could encapsulate details – (This is Future Work) • Gamebryo Volatile Type Buffers – D3DUSAGE: WRITEONLY | DYNAMIC D3DLOCK: NOOVERWRITE, DISCARD – Lock() is stored into command buffer – Memory allocated from command buffer, returned from Lock() – At playback, true lock is performed • Gamebryo Mutable Type Buffers: – CPU read and infrequent access – Backing store required, copied on each Lock() Implementation Considerations • Ease of changing implementation – Macros provide implementation – Preprocessor & Beautifier produce debuggable code – Many macro permutations required (~40) for different argument count and return type • Generated from Excel – Function overloading to store non ref counted parameters • Everything but shader constants then stored with same function signature. Application Models • Command buffers can be used in various ways by applications – Fork and join – Fork and join, frame deferred – Work queue –… • Record once, play back several times Fork & Join Main Thread Worker Thread …Update … Signal threads to start Record command buffer Render Wait for command buffers Playback command buffers Wait for signal Record command buffer Signal command buffer complete Starve! Fork & Join, Frame Deferred Main Thread Worker Thread …Update … Signal threads to start Wait for signal Render Record command buffer Record command buffer Signal command buffer complete … Update… Next Frame Playback command buffers Work Queue Worker Thread Update Play Record Record Update Record Update Play Main Thread Adapting to an Existing Codebase • Refactor code to take pointer to device that can be changed easily – Easy if pointer passed on stack – Thread local storage if used from heap • Add ownership of recording devices, playback class, and pool of command buffers • Determine application model, and add high-level logic to parcel out rendering work. • Manage resources over recording and playback Integration into DX Samples • Instancing – Effects, shader constants • Textures tutorial – Simple, added multithreading • Stress test – Fork and join multithreading, with optional: • Frame delay of playback • Draw call count • CPU and memory access • Recorder thread count Stress Test Information • Render call contains: – Matrices computed with D3DX calls * 3 – SetTransform * 3 – SetRenderState – SetTexture – SetTextureStageState * 8 – SetStreamSource – SetFVF – DrawPrimitive CPU Busy Loops • Draw call CPU cost varies in real applications • Stress test simulates cost with CPU Busy Loops – Scattered reads from a large buffer in memory – Perform some logic, integer, and floating point operations • Gamebryo render on DX9: 100-200 μs • (on a Pentium 4, 3 GHz, nVidia 7800) • Stress test can simulate Gamebryo render calls with 0-200 loops. DX Sample Stress Test Demo DX Call Cost vs. Recorder Cost • Render call cost with DirectX device is 13 times as expensive as command buffer recorder – DX: 92μs – Recorder: 7μs • (on a Pentium 4, 3 GHz, nVidia 7800) Thread Profiler Quadcore 1 Recorder Thread Record • CPU Busy Loops: 110 Playback Thread Profiler Quadcore 4 Recorder Threads FPS by Threads and Computer CPU Busy Loops 150 DrawPrimitives 1936 Sum of FPS 70 Cores Computer 60 GPU 50 2 - XP-A - Intel G965 Express 40 2 - XP-A - NVIDIA GeForce 7800 GTX 30 2 - XP-B - NVIDIA GeForce 8800 GTS 512 20 4 - XP-C - NVIDIA GeForce 8800 GT 10 4 - Vista-A - NVIDIA GeForce 8800 GTX 0 0 1 2 Threads 3 4 5 Definition: Performance Ratio • Charts that follow use Performance Ratio = FPS test / FPS baseline • Normalized result • Useful for comparisons while varying – Number of draw calls – CPU busy loops Perf by Threads & Busy Loops Computer XP-C Cores 4 Draw Primitives 1936 Average of FPSPerfRatio 4 3.5 3 CPU Busy Loops 0 2.5 50 2 100 150 1.5 200 250 1 0.5 0 0 1 2 3 Threads 4 5 Perf by Threads & Busy Loops Computer XP-C Cores 4 Draw Primitives CPU Busy Loops Average of FPSPerfRatio 100 - 0 4 100 - 50 3.5 100 - 100 100 - 150 3 100 - 200 100 - 250 2.5 196 - 0 2 196 - 50 196 - 100 1.5 196 - 150 196 - 200 1 196 - 250 0.5 289 - 0 289 - 50 0 0 1 2 3 4 5 289 - 100 289 - 150 Threads 289 - 200 289 - 250 Perf by Draws & Busy Loops Computer XP-C Cores 4 Threads 4 Average of FPSPerfRatio 4 3.5 CPU Busy Loops 3 0 2.5 50 2 100 150 1.5 200 1 250 0.5 DrawPrimitives 1936 1849 1764 1681 1600 1444 1369 1296 1156 1089 961 900 784 676 576 484 400 289 196 100 0 Perf by Busy Loops & Draws Computer XP-C Cores 4 Threads 4 Average of FPSPerfRatio 4 3.5 DrawPrimitives 3 100 2.5 400 2 784 1156 1.5 1600 1 1936 0.5 0 0 50 100 150 CPU Busy Loops 200 250 Dual Core Results Cores 2 DrawPrimitives 1936 Average of FPSPerfRatio 2 1.8 1.6 GPU 1.4 CPU Busy Loops 1.2 Intel G965 Express - 50 1 Intel G965 Express - 150 NVIDIA GeFo rce 7800 GTX - 50 0.8 NVIDIA GeFo rce 7800 GTX - 150 NVIDIA GeFo rce 8800 GTS 512 - 50 0.6 NVIDIA GeFo rce 8800 GTS 512 - 150 0.4 0.2 0 0 1 2 Threads 3 Future Work • Resource management facilitated through command buffer, instead of application logic • Optimization of command buffers by reordering order independent draw calls • DirectX10 Open Source Library • Emergent has open sourced the command buffer library – Command buffer serialization – Recording device – Playback class – Redirecting device – EffectStateManager – DX9 only so far Thank You. Questions? – [email protected] – Co-Developer: Bo Wilson • For code & presentation Google: parallel rendering scheib