Link opens in new browser window.

3.0 Shaders
Jason Mitchell
ATI Research
Outline
• Vertex Shaders
– Vertex Textures
– Flow control
• Pixel Shaders
– Flow control
– Optimization
• Shadow Mapping
– New functionality
• vPos for interleaved sampling
3.0 Vertex Shaders
• Texture lookups
• Loop indexable inputs (vn) and outputs (on)
– Not just constants
• More temps (32)
• Longer programs
– At least 512 instructions. See
MaxVertexShader30InstructionSlots for exact number on a
given chip
• Same flow control as devices which
support the vs_2_a compile target
Vertex Texturing
• With vs_3_0, vertex shaders can
sample textures
• Many applications
– Displacement mapping
– Large off-chip matrix palette
– Generally cycling processed data
(pixels) back into the vertex engine
Vertex Texturing Details
• With the texldl instruction, a vs_3_0 shader
can access memory
• The LOD must be computed by the shader
• Four texture sampler stages
– D3DVERTEXTEXTURESAMPLER0..3
• Use CheckDeviceFormat() with
D3DUSAGE_QUERY_VERTEXTEXTURE to
determine format support
• Look at VertexTextureFilterCaps to
determine filtering support
vs_3_0 Outputs
• 12 generic output (on) registers
• Must declare their semantics upfront like the input registers
• Can be used for any interpolated
quantity (plus point size)
• There must be one 4-component
output with the positiont semantic
Semantic Linkage
• Must use 3.0 vertex and pixel
shaders together
• Input declarations take the usage
names, and multiple usages are
permitted for components of a
given register
Connecting VS to PS
3.0 Vertex Shader
2.0 Vertex Shader
FFunc
oFog oPos
oPts oD0 oD1 oT0 oT1 oT2 oT3 oT4 oT5 oT6 oT7
o0 o1
o2
Triangle
Setup
o3
o4
o5
o6
o7
o8
o9 o10 o11
Semantic Mapping
Triangle
Setup
v0
v1
t0
t1
t2
t3
t4
t5
t6
t7
v0
2.0 Pixel Shader
v1
v2
v3
v4
v5
v6
v7
v8
v9
vPos.xy
3.0 Pixel Shader
vFace
vs_3_0 Semantic Declaration
vs_3_0
dcl_color4 o3.x
dcl_texcoord3 o3.yz
dcl_fog o3.w
dcl_tangent o4.xyz
dcl_positiont o7.xyzw
dcl_psize o6
...
// color4 is a semantic name
// Different semantics can be packed into one register
// positiont must be declared to some unique register
// in a vertex shader, with all 4 components
// Pointsize cannot have a mask
Dynamic Flow Control
• The HLSL compiler has a set of heuristics
about when it is better to emit an algebraic
expansion, rather than use real dynamic
flow control
–
–
–
–
Number of variables changed by the block
Number of instructions in the body of the block
Type of instructions inside the block
Whether the HLSL has texture or gradient
instructions inside the block
• Blindly changing compile targets can kill your
performance, especially if you nest ifs
Hardware Parallelism
• There are many shader units executing in parallel
• Dynamic flow control can cause inefficiencies since
different pixels/vertices can take different code paths
• Hardware will compute the right results, but you will
not always see the intended performance gain
• For an if…else, there will be cases where evaluating
both the blocks is faster than using dynamic flow
control, particularly if there is a small number of
instructions in each block
• Depending on the mix of vertices or pixels, the worst
case performance can be worse than executing
straight line code without any branching at all
Caveat emptor
Pixel Shaders
• Semantic linkage with vertex shader
– Similar to vertex declarations
– Generic vn registers at asm level like vertex shader (all fp)
• Dynamic flow control
– caveat emptor
• Longer programs
– At least 512 (cap’d MaxPixelShader30InstructionSlots)
• More registers
– Constants (224) and temps (32)
• Indexable input registers (but not constants)
• tex*Dlod (texldl at asm level)
– Specify LOD (not bias) directly in texture load instruction
• New registers
–
–
–
vFace – Scalar face register
vPos - Screen (x, y) position register
aL – Loop counter
Input Registers
• Bank of 10 floating point registers
• Indexable
vFace
• Scalar register whose sign indicates
the facing-ness of the triangle
– Positive for front facing
– Negative for back facing
• Can be interesting for things like
two-sided lighting
• In future shader models, will
contain primitive area
Pixel Shader Loop Register (aL)
• Incremented by loop...endloop
block
• Can be used to index into
interpolator registers only
Looping and HLSL
• Most of the time, this is a convenience to the
developer and will actually be unrolled
• Dynamic number of iterations
– Make it obvious to the compiler that there is an upper
limit to the number of iterations that may dynamically
occur
• HLSL constructs which cause unrolling of dynamic
(not static) loops
– Anything that needs a gradient (i.e. tex2D)
– Indexing a local array, because these are not actually
indexable in the virtual shader machine
– Can index input iterators
• There is no break keyword in HLSL
– Can be generated by the compiler in the asm based
upon condition in while
– Will show this in a later example
Known bounds on iteration
float4 ps_main( float4 inTexCoord : TEXCOORD0,
float3 inOffset
: TEXCOORD1 ) : COLOR0
{
float4 fH = 0;
Speeds up
compilation
// Sample iteration map to determine how much to iterate
int nNumSamples = (int)(tex2D( sAMap, inTexCoord ).r * 255.0) % 15;
float2 dx = ddx( inTexCoord );
float2 dy = ddy( inTexCoord );
for ( int nIndex = 0; nIndex < nNumSamples; nIndex++ )
{
float2 texOffset = inTexCoord + inOffset * nIndex;
fH += tex2Dgrad( sBMap, texOffset, dx, dy ).w;
}
return fH;
}
Resulting Assembly
ps_3_0
def c0, 255, 0, 1, 0
def c1, 15, -15, 0, 0
defi i0, 15, 0, 0, 0
dcl_texcoord v0.xy
dcl_texcoord1 v1.xy
dcl_2d s0
dcl_2d s1
…
dsx r3.xy, v0
dsy r4.xy, v0
mov r1, c0.y
mov r0.w, c0.y
rep i0
break_ge r0.w, r0.z
mov r0.xy, v0
mad r0.xy, v1, r0.w, r0
texldd r2, r0, s0, r3, r4
add r0.w, r0.w, c0.z
add r1, r1, r2.w
endrep
mov oC0, r1
Returning
• If you want to return inside of an
if…else it must be symmetric
Symmetric returns
edge = tex2D(EdgeSampler, oTex0).r;
if(edge > 0)
{
return tex2Dlod(BaseSampler, oTex0);
}
else
{
return 0;
}
texld r0, v0, s1
cmp r0.w, -r0.x, c0.x, c0.y
if_ne r0.w, -r0.w
texldl oC0, v0, s0
else
mov oC0, c0.x
endif
vPos
• vPos.xy contains screen-space
position (z and w are undefined)
• Useful for screen-space
operations such as interleaved
sampling (see [Keller01])
Interleaved Sampling
• Do slightly different operations at
neighboring pixels in screen space
• Two examples shown here:
1. Volumetric Light shafts
• Tweak position used in volume rendering
2. Shadow filtering
• Vary filter kernel layout as a function of
screen position
Light Shafts with Interleaved Sampling
struct PsInput
{ float4 vWorldPos[4]
float4 vClipPos
float2 vScreenPos
};
: TEXCOORD0;
: TEXCOORD4;
: VPOS;
float4 main (PsInput i) : COLOR {
…
0
2
0
2
3
1
3
1
0
2
0
2
3
1
3
1
// Based on the screen (x,y), determine whether the pixel is even or odd
int2 vEvenOdd = (int) floor(fmod((i.vScreenPos.xy + 0.5), 2.0));
int
iIndex = abs(3 * vEvenOdd.x - 2 * vEvenOdd.y);
// Calculate the projective texture coordinate for the selected plane
float4 vTexProj = mul(i.vWorldPos[iIndex], mLightViewProjBias);
…Sample cookie, shadow and noise maps using tweaked coordinates
Compute attenuation based on tweaked position…
// Final color output
float fIntensity = fCompositeNoise * cCookie.rgb * fAtten * fScale;
o.rgb = fIntensity;
o.a = saturate(dot(o.rgb, float3(1.0f, 1.0f, 1.0f)));
return o;
}
Light Shafts with Interleaved Sampling
25 planes without
interleaved
sampling
25 planes with
interleaved
sampling
Spatially-varying PCF Offsets
4×4 (16-tap) PCF
•
Grid-based PCF kernel needs to be fairly large to eliminate aliasing
–
•
12-tap Spatially Varying PCF
with Irregular sampling
Particularly in cases with small detail popping in and out of the
the underlying hard shadow.
Irregular sampling allows us to get away with fewer samples
–
Error is still present, only the error is “unstructured” and thus
thus less noticeable
Percentage Closer Filtering
Depth Sample at 49.8
50.1
50.2
50.0
50.0
x
<49.8?
1.0
1.2
1.1
1.4
1.2
1.0
29.8
Filter Depth Map
Compare
Standard filtering: Filter depth first, then use value for shadow map comparison.
Depth Sample at 49.8
50.1
50.2
50.0
50.0
x
1.0
1.2
1.1
1.4
1.2
<49.8?
Per-Element
Compare
0
0
1
0
1
1
0
1
1
0.55
Percentage Filter
Percentage Closer Filtering: Perform shadow map comparison for each kernel
elements first, then filter results!
Irregular Filter Kernel
Spatially-Varying Rotation
// Look up rotation for this pixel
float2 rot = BX2( tex2Dlod(RotSampler,
float4(vPos.xy * g_vTexelOffset.xy, 0, 0) ));
for(int i=0; i<12; i++) // Loop over taps
{
// Rotate tap for this pixel location
rotOff.x = rot.r * quadOff[i].x + rot.g * quadOff[i].y;
rotOff.y = -rot.g * quadOff[i].x + rot.r * quadOff[i].y;
offsetInTexels = g_fSampRadius * rotOff;
// Sample the shadow map
float shadowMapVal = tex2Dlod(ShadowSampler,
float4(projCoords.xy + (g_vTexOff.xy * offInTexels.xy), 0, 0));
// Determine whether tap is in light
inLight = ( dist < shadowMapVal );
// Normalize
percentInLight += inLight;
}
Obvious Early-Out Optimizations
• Zero skin weight(s)
– Skip bone(s)
• Light attenuation to zero
– Skip light computation
• Non-positive Lambertian term
– Skip light computation
• Fully fogged pixel
– Skip the rest of the pixel shader
• Shadow Filtering
– Only run costly filter in possible penumbra regions
• Many others like these…
Shadow Filtering with ps_3_0
• Only do expensive filtering in areas
likely to be penumbra regions
– Dynamic flow control in pixel shader
• Can mask with a variety of values (no
light or full light means no penumbra!)
–
–
–
N·L
Projective Cookie texture (aka Gobo)
Edge-filtered shadow map
Simple example scene
Shadow Depth Map
Desired final image
Shadow Map Edges
Mask off expensive filtering
N·L < 0
Gobo == 0
Only the white
pixels execute the
expensive path
Shadow
Edge Filter
Union of
all three
masks
HLSL Shader With Early-Outs
...Compute projective coordinates and N.L...
if (dot(lightVal, float3(1,1,1)) == 0 ) {
return 0;
}
else
{
...Sample edge map...
if (edgeVal == 0) //compute hard shadows if we’re not near an edge
{
shadowMapVal = tex2Dlod(ShadowSampler, projCoords );
inLight = ( dist < shadowMapVal );
percentInLight = dot(inLight, 0.25f );
return (percentInLight * lightVal);
}
else
{
randRot = BX2( tex2Dlod(RandRotSampler, float4(vPos * g_vFullTexelOffset,0,0) ));
for (int i=0; i<12; i++)
{
...Do each expensive shadow sample...
}
return (percentInLight * lightVal);
}
}
Resulting Assembly
...
mul r0, r0, r1.z
dp3 r1.z, r0, c5.w
cmp r1.z, -r1_abs.z, c5.w, c5.z
if_ne r1.z, -r1.z
mov oC0, c5.z
else
rcp r5.z, r1.w
rcp r1.w, v1.w
mul r2.xy, r1.w, v1
mov r2.z, c2.x
texldl r1, r2.xyzz, s0
cmp r1.w, -r1_abs.x, c5.w, c5.z
if_ne r1.w, -r1.w
mov r2.w, c5.z
texldl r1, r2.xyww, s2
mad r1, r5.z, c1.x, -r1
cmp r1, r1, c5.z, c5.w
dp4 r1.w, r1, c6.x
mul oC0, r0, r1.w
else
mul r1.xy, vPos, c4
...130 instructions...
mul oC0, r0, r1.w
endif
endif
Aliasing due to Conditionals
• Conditionals in pixel shaders can cause
aliasing!
• You want to avoid doing a hard conditional
with a quantity that is key to determining
your final color
– Do a procedural smoothstep, use a pre-filtered
texture for the function you’re expressing or
bandlimit the expression
– This is a fine art. Huge amounts of effort go into
this in the offline world where procedural
RenderMan shaders are a staple
• On ps_2_a and ps_3_0, you can find out the
screen space derivatives of quantities in the
shader for this purpose.
Shader Antialiasing
•
•
•
Computing derivatives (actually differences) of shader quantities with respect
to screen x, y coordinates is fundamental to procedural shading
LOD is calculated automatically based on a 2×2 pixel quad, so you don’t
generally have to think about it, even for dependent texture fetches
The HLSL dsx(), dsy() derivative intrinsic functions, available when
compiling for ps_2_a and ps_3_0, can compute these derivatives
ds dt dr
dx dx dx
ds dt dr
dy dy dy
•
•
Use these derivatives to antialias your procedural shaders or
Pass results of dsx() and dsy() to texn
texnD(s, t, ddx,
ddx, ddy)
ddy)
Derivatives and Dynamic Flow Control
• The result of a gradient calculation on a
computed value (i.e. not an input such as a
texture coordinate) inside dynamic flow control
is ambiguous when neighboring pixels in a 2×2
quad may go down different paths
• Hence, nothing that requires a derivative of a
computed value may exist inside of dynamic
flow control
– This includes most texture fetches, dsx() and dsy()
– texldl and texldd work since you can compute the
LOD or derivatives outside of the dynamic flow
control
• RenderMan has similar restrictions
Derivatives and Dynamic Flow Control
float edge = tex2D(EdgeSampler, oTex0).r;
float2 duvdx = ddx(oTex0);
float2 duvdy = ddy(oTex0);
if(edge > 0)
{
return tex2D(BaseSampler, oTex0, duvdx, duvdy);
}
else
{
return 0;
}
texld r0, v0, s1
cmp r0.w, -r0.x, c0.x, c0.y
dsx r0.xy, v0
dsy r1.xy, v0
if_ne r0.w, -r0.w
texldd oC0, v0, s0, r0, r1
else
mov oC0, c0.x
endif
Summary
• Vertex Shaders
– Vertex Textures
– Flow control
• Pixel Shaders
– Flow control
– Optimization
• Shadow Mapping
– New functionality
• vPos for interleaved sampling
Acknowledgements
• Big thanks to John Isidoro, Natalya
Tatarchuk and Dan Ginsburg for
many of the examples used in this
presentation
References
• [Keller01] Alexander Keller and Wolfgang
Heidrich, “Interleaved Sampling,”
Eurographics Rendering Workshop 2001.
• [Reeves87] William T. Reeves, David H.
Salesin, and Robert L. Cook, "Rendering
Antialiased Shadows with Depth Maps",
SIGGRAPH, 1987, pp. 283-291.