Slides

DirectX10:
DirectX10: porting,
porting,
performance
performance and
and “gotchas”
“gotchas”
Guennadi
Guennadi Riguer
Riguer
AMD
AMD
DirectX 10 and Me…
The best DirectX ever – I love it!
! Powerful feature set
!
With power comes responsibility
! More ways to do something wrong and kill
performance
!
!
New or changed behaviors – many
porting “gotchas”
Deprecated Features
Alpha test
! Triangle fans
! Point sprites
! Wrap texture modes
! TnL clip planes
!
Fullscreen Initialization
!
Special flag in swap chain description to
allow mode switch
DXGI_SWAP_CHAIN_DESC scd;
scd.Flags = DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH;
D3D10CreateDeviceAndSwapChain(…, scd, …);
!
Otherwise back buffer is stretch blit’ed to
desktop resolution
Pixel Coordinate System
Finally pixels and texels match
! Don’t need to offset position or texture
coords by 0.5 texel
!
(0, 0)
(0, 0)
DirectX 9
DirectX 10
Pixel Coordinate System
This will affect all the screen space
DirectX 9 code/shaders
! If offset is done in PS, move it to app
code
!
!
Make shaders DX9/10 cross-compatible
Small Batch Problem
!
DirectX 10 API helps
!
!
Currently up to 2x measured improvement
over DX9 from API changes
New features provide additional boost
Instancing
! Uber-shaders
! GS (e.g.. render to cubemap)
!
!
Trade GPU performance to solve problem
Instancing
Draw as many objects per draw call as
possible
! Create object variations with…
!
Large constant storage
! Displacement maps
! Uber-shaders
!
Uber-shaders
Combine multiple materials within single
shader
! Pros:
!
!
!
Keep common code portion in on-chip shader
cache
Cons:
Flow control
! Higher GPR pressure
!
Optional DX10 Features
!
Yes, there’s such a thing
Some format support is optional
! E.g. MSAA
! E.g. FP32 filtering
!
!
Always check with
ID3D10Device::CheckFormatSupport()
MSAA Capabilities
!
Multiple support flags for MSAA:
! D3D10_FORMAT_SUPPORT_MULTISAMPLE_RENDERTARGET
! D3D10_FORMAT_SUPPORT_MULTISAMPLE_RESOLVE
! D3D10_FORMAT_SUPPORT_MULTISAMPLE_LOAD
!
Some formats might be renderable, but
not resolvable!
Constants
!
Literal HLSL constants are the fastest,
don’t put common constants in buffers
!
!
E.g. -1, 0.5, 2…
Non-indexed constants or indexed with a
literal could be faster than indexed with a
computed value
!
Is there anything you can do to help this,
beyond simple unrolling?
Constant Management
Need to specify const location in HLSL
when not using Effects
! Even with Effects might want to do
“smarter” custom const management
! Special syntax for “manual” constant
placement
!
Manual Const Binding
!
HLSL syntax
// Will bind constant buffer to slot #4
cbuffer MyConstantBuffer : register(b4)
{
// Specify exact offset
float fMyConstant1 :
packoffset(c2.z);
float4 vecMyConstant2 : packoffset(c4);
// etc.
};
Primitive Topology and Draw Calls
!
Primitive topology is separate from draw
function
dev->IASetPrimitiveTopology(blah);
dev->Draw(3, 0);
!
Keep these 2 functions together
Easy to forget to set proper topology
! Will be chasing false corruption, driver bug
and etc.
!
Input Layout
Matching vertex data to VS inputs
! Unlike in DirectX 9 it now requires
knowledge about VS (input signature)
! Don’t need to create a unique one for
each VS
!
!
Enough if shader signature matches
Input Layout
Knowledge of VS at decl/input layout
creation time might be a problem for DX9
engine architectures
! Quick DX9 port hack
!
Keep copies of “dummy” VS just for the
signatures
! Create input layouts from them
!
Stream Out
Primitive topology is converted to lists
! For subsequent passes loses the benefit
of the vertex reuse
! Solution:
!
Use points for stream out
! Use indexed primitives with proper topology
on the second pass
!
Stream Out
With SO enabled each call appends to
the end of the bound buffers
! Need to bind the same buffers again to
output to the beginning of the buffers
!
!
!
Might interfere with renderer state caching
Good idea to un-bind buffers as soon as
done with SO
!
Easy to forget and cause some corruption
Stream Out FX Syntax
// Output position only
SetGeometryShader(ConstructGSWithSO(shader,
"SV_Position"));
// Output 2 streams, use 2D tex coords
SetGeometryShader(ConstructGSWithSO(shader,
“0:SV_Position,1:tex.xy"));
// Output only z component of position
SetGeometryShader(ConstructGSWithSO(shader,
“SV_Position.z"));
Stream Out From VS
!
Output is after GS, but GS can be passthrough NULL shader
Pass VS code instead of GS
! Just requires the output signature from VS
!
!
Don’t write the actual GS shader!
Stream Out From VS
!
Effects code:
technique10 t0
{
pass p0
{
SetVertexShader(CompileShader(vs_4_0,
VsMain()));
SetGeometryShader(ConstructGSWithSO(
CompileShader(vs_4_0, VsMain()),
"SV_Position"));
SetPixelShader(CompileShader(ps_4_0, PsMain()));
}
}
SO as VS optimization
!
Unified architecture requires a new thinking
VS isn’t free anymore when PS-bound
! Think about total workload throughout VS/GS/PS
!
!
Could stream out data that would be
computed multiple times in VS/GS
!
!
E.g. animation
SO might not be a win for small VS
Scatter Implementation
SO isn’t flexible enough to implement
scatter
! Could implement scatter with point
primitives
!
!
!
Tweak point position in VS/GS
E.g. sort on GPU, data binning, etc.
Data Binning with Scatter
!
E.g. building histogram for HDR
!
Draw points
!
!
Output is 1D render target with histogram
For each point in the image
Fetch from RTT in VS
! Adjust position (bin index) based on the texture
value, output 1 for counting
! Additively blend to count points in the bins
!
Shader Signature Matching
In DirectX 9 semantics matched
automatically
! In DirectX 10 shader input/output
structure order should match
!
Put optional data at the end
! Partial match might be hard to track
!
!
Might be a big problem for DX9 ports
GS: Edge Orientation
!
When computing per-edge data need to
ensure it matches for adjacent triangles
!
!
Could tag vertices for selecting proper
orientation
E.g. edge tessellation, fur fins
Right
Wrong
GS: Order of Output Triangles
!
Well defined output triangle order
!
!
It’s as if each triangle processed serially
Consider interaction with transparency
!
E.g. won’t work for fur shells generation
GS: Triangle Winding
Backface culling happens in the
rasterizer after GS
! Need to keep winding in mind when
generating geometry in GS
!
!
Easy to neglect and blame drivers for missing
geometry
GS: Render to Cubemap
!
An interesting feature to combat small
batch performance
!
!
GS replicates triangles to different cubemap
faces
Performance tradeoff
Lighter CPU load (fewer draw calls)
! Heavier GPU load
!
!
Could cull in GS to reduce amount of
generated data
Render to Cubemap Example 1/2
[maxvertexcount(18)]
void main(triangle GsInShadow In[3],
inout TriangleStream<PsInShadow> Stream)
{
PsInShadow Out;
// Loop though all faces
[unroll]
for (int k = 0; k < 6; k++) {
// Select face target
Out.target = k;
// Transform verts
float4 pos[3];
pos[0] = mul(mvpArray[k], In[0].pos);
pos[1] = mul(mvpArray[k], In[1].pos);
pos[2] = mul(mvpArray[k], In[2].pos);
// Frustum culling
float4 t0 = saturate(pos[0].xyxy*float4(-1,-1,1,1)-pos[0].w);
float4 t1 = saturate(pos[1].xyxy*float4(-1,-1,1,1)-pos[1].w);
float4 t2 = saturate(pos[2].xyxy*float4(-1,-1,1,1)-pos[2].w);
float4 t = t0 * t1 * t2;
[branch]
if (!any(t)) {
Render to Cubemap Example 2/2
. . .
// Back face culling
float2 d0 = pos[1].xy/abs(pos[1].w)-pos[0].xy/abs(pos[0].w);
float2 d1 = pos[2].xy/abs(pos[2].w)-pos[0].xy/abs(pos[0].w);
[branch]
if (d1.x * d0.y > d0.x * d1.y) {
// Triangle is visible - emit
[unroll]
for (int i = 0; i < 3; i++) {
Out.pos = pos[i];
// Other data processed here
// . . .
Stream.Append(Out);
}
Stream.RestartStrip();
}
}
}
}
Position in PS
!
PS could have SV_Position as input
!
!
…VS or GS also have SV_Position output
Same name, different values
VS/GS – position in clipping space
! PS – screen space position (like vPos
register) and z, rhw
!
!
Returns 0.5, 1.5, 2.5, … coordinates
!
Due to new pixel coordinate system
Integer Data and Instructions
!
Use integer types to compact data
Bit packing vertex data, consts, etc.
! E.g. conditionals for uber-shaders encoded
per vertex or per primitive
!
!
Beware of some int instruction cost
Division is expensive
! Explore optimization opportunities
!
Backwards Compatible
Shader Compilation
!
Special compiler flag
! D3D10_SHADER_ENABLE_BACKWARDS_COMPATIBILITY
Enable old shaders to compile to SM 4.0
! Not valid for geometry shaders
!
!
Could cause compilation errors when
mixing GS with old VS
Old POSITION semantic is translated to
SV_Position
! Solution: replicate structures for GS with new
semantics
!
Texture/Sampler States
Samplers and textures are decoupled in
DirectX 10
! Important to keep in mind when porting
from DirectX 9
!
!
Might require significant engine rearchitecture effort
Disabling Mipmapping
!
There’s no NONE mip filtering mode in
DirectX 10
!
!
No direct way to disable mipmapping
Hacks to emulate functionality
Textures with only 1 mip level
! Setting MaxLOD in sampler state to 0
!
D3D10_SAMPLER_DESC samp;
samp.MaxLOD = 0.0f;
Integer Textures
Default texture declaration type is float
! Integer type texture declaration
!
// Using unsigned int
Texture2D <uint4> myTex0;
// Using int
Texture2D <int4> myTex1;
!
E.g. important when reading stencil
format (X24_TYPELESS_G8_UINT)
Reading Depth/Stencil
!
Cannot simultaneously read depth and
stencil from the same buffer
!
Create 2 separate shader resource views
Depth
! Stencil
!
!
!
Treat as 2 separate textures in shader
MSAA depth/stencil buffers can’t be read
Gradients and Flow Control
Same as before, can’t sample textures
with varying texture coordinates (different
across pixel quad)
! Compiler is smart figuring out what is
varying and what isn’t
!
!
Don’t blindly use t.SampleGrad() everywhere
MRT Rendering
!
MRTs are more flexible in DirectX 10
Up to 8 render targets
! Any of 8 slots can be set
!
!
For performance reasons use the lowest
MRT slots possible
!
!
Don’t leave holes in MRT slot assignment
Can’t mix MSAA and non-MSAA
MRT Rendering
Even if MRT #0 isn’t bound, target #0
alpha is used for Alpha-to-Coverage
! Don’t forget to enable render target
masks to enable MRT output
!
!
Separate controls for each MRT
// enable MRT
D3D10_BLEND_DESC bd;
bd.RenderTargetWriteMask[0] = 0x0f;
bd.RenderTargetWriteMask[1] = 0x0f;
bd.RenderTargetWriteMask[4] = 0x0f;
MRT vs. Indexable Render Target
Array
!
MRT and render target arrays are
orthogonal
MRT 0
MRT
00
MRT
MRT 0
MRT 1
MRT example
MRT
00
MRT
MRT 0
MRT
11
MRT
MRT 1
Render array
example
MRT + render array
example
NULL Outputs
Can have NULL SO, RTs and depth
! Can disable RTs/depth while doing
stream out only
!
Depth Clamping/Clipping
!
Back-end always clamps depth
!
!
Both interpolated and from PS
Viewport can enable/disable depth
clipping before it gets to clamping
// enable depth clipping
D3D10_RASTERIZER_DESC rd;
rd.DepthClipEnable = true;
!
Also, W < 0 will clip
sRGB implementation differences
!
DirectX 10 sRGB implementation is
different from DirectX 9
Differently spec’ed gamma curve
! All blend and filter in linear space
!
sRGB fetch – degamma before filter
! sRGB blend – degamma DEST before blend
!
!
DirectX 10: correct implementation that
could make DX9 content look wrong
Separate Alpha Blend
No more separate alpha blend enable
! It’s always on, so don’t forget to set
!
SrcBlendAlpha
! DestBlendAlpha
! BlendOpAlpha
!
Dual Source Color Blending
Uses 2 PS outputs for blending equation
! New alpha blend arguments
!
D3D10_BLEND_SRC1COLOR
! D3D10_BLEND_INVSRC1COLOR
! D3D10_BLEND_SRC1ALPHA
! D3D10_BLEND_INVSRC1ALPHA
!
!
Doesn’t work with MRT!
Alpha-to-Coverage
!
Works even without MSAA
!
!
Can produces screen-door effect
Implementation is IHV dependent
Grey area of the spec
! For better quantization HW can use some
area dithering
! Be careful not to make assumptions about
implementation
!
Custom AA Resolves
!
Application can implement custom
resolves in PS
Access to individual RT samples
! Special syntax for accessing samples
!
// Declare MSAA texture with 4 samples
Texture2DMS<float4,4> t;
// Load sample #0
a = t.Load(tc, 0); // tc – unnormalized tex. coords
// Load sample #1
b = t.Load(tc, 1);
Custom AA Resolves and
HDR
!
Need to perform tone mapping before
resolve to get correct MSAA results
Standard resolve
Custom AA Resolves and
HDR
!
Need to perform tone mapping before
resolve to get correct MSAA results
Custom resolve with tone mapping
Custom AA Resolves and
HDR
Texture2DMS<float4, SAMPLES> tHDR;
float4 main(float4 pos: SV_Position) : SV_Target
{
int3 coord;
coord.xy = (int2)In.pos.xy;
coord.z = 0;
// Correct exposure for individual samples and sum it up
float4 sum = 0;
[unroll]
for (int i = 0; i < SAMPLES; i++)
{
float4 c = tHDR.Load(coord, i);
sum.rgb += 1.0 - exp(-exposure * c.rgb);
}
sum *= (1.0 / SAMPLES);
// Gamma correction
sum.rgb = pow(sum.rgb, 1.0 / 2.2);
return sum;
}
Acknowledgement &
References
!
Big thanks for help and comments to:
!
!
Nicolas, Emil, Natasha, Thorsten and the rest
of ISV and 3DARG teams
Real-time HDR histogram generation
!
Check out upcoming I3D ’07 paper by
Thorsten Scheuermann
Questions