Dark Secrets of Shader Development or What Your Mother Never Told You About Shaders Overview • What are shaders? – Shader compilation process • Shader optimizations – Non-hardware specific shader optimizations – Hardware specific shader optimizations for ATI Vertex shader optimizations • Pixel shader optimizations • What Are Shaders? Shaders are micro-programs controlling vertex and pixel engines in the graphics hardware • In DirectX® shaders are streams of tokens sent to hardware through API • – – – – Tokens are specially coded assembly instructions Drivers receive shaders in their genuine form DirectX® shaders are special pseudo code (p-code) Drivers receive macros just like any other instructions What Do Drivers Do With Shaders? DirectX® shaders don’t exactly correspond to the hardware shader implementation • Drivers compile for specific hardware platform into special hardware microcode (µ-code) • Drivers optimize shaders for specific platform • – Carefully designed shaders allow drivers to optimize shader code more efficiently for specific hardware Shader Compilation Process Front-end Compiler HLSL shader Driver Compiler HLSL Compiler HW Optimizer ASM shader ASM Compiler ASM shader Back-end Compiler HW µ-code P-code Shader at Various Compilation Stages HLSL code: Assembly code: DirectX® p-code: Hardware specific µ-code: float x=(a*abs(f))>(b*abs(f)); abs mul mul slt r7.w, r2.w, r9.w, r0.w, 02000023 03000005 03000005 0300000c c2.x r7.w, c0.x r7.w, c1.x r9.w, r2.w 80080007 80080002 80080009 d00f0000 983a28f41595 329d8d123c04 329d8d627c08 3b487794333a a0000002 80ff0007 80ff0007 80ff0009 a0000000 a0000001 80ff0002 Shader Optimizations • Optimization – process of making something as perfect, functional or effective as possible – Make development process as efficient as possible – Make shaders perform as fast as possible – Make rendered images as perfect looking as possible • Optimization is an art of balancing performance, image quality and efforts it takes to makes it “perfect” Shader Optimizations Use tools like RenderMonkey for development and visual debugging • Use HLSL and concentrate efforts on algorithmic optimizations • Use lower level optimizations specific for shader processors (i.e. vectorize calculations) • – Requires good understanding of hardware and knowledge of assembly • Use hardware specific optimizations whenever necessary to get the most out of hardware Optimization #1 (HLSL) Use Intrinsic Functions • Don’t reinvent the wheel, use intrinsic functions – Code uses all available hardware features – Optimized code for each shader model Without Optimization float MyDOT(float3 v1, float3 v2) { return (v1.x * v2.x + v1.y * v2.y + v1.z * v2.z); } . . . float v = MyDOT(N, L); . . . With Optimization . . . float v = dot(N, L); . . . Optimization #1 (HLSL) Use Intrinsic Functions • Assembly translation of HLSL code... Without Optimization vs_2_0 . . . mul r0.xy, v0, c0 add r0.w, r0.y, r0.x mad oPos, c0.z, v0.z, r0.w . . . With Optimization vs_2_0 . . . dp3 r0.w, v0, c1 . . . Optimization #2 (HLSL) Properly Use Data Types • Use the most appropriate data types in calculations (float, float2, float3 and float4) – Don’t use vector in place of scalar calculations – Don’t use float4 where you could use float3 • This optimization allows HLSL compiler and/or driver optimizer to pair shader instructions whenever possible Optimization #3 (HLSL) Reduce Typecasting • Get rid of typecasting when it’s not needed – Indirectly initialize unused vector components (i.e. alpha channel) Without Optimization With Optimization sampler texMap; float4 diff, amb; sampler texMap; float4 diff, amb; float4 main(float2 t: TEXCOORD0) : COLOR { float3 color = tex2D(texMap, t); color *= diff + amb; return float4(color, 0); } float4 main(float2 t: TEXCOORD0) : COLOR { float4 color = tex2D(texMap, t); color *= diff + amb; return color; } Optimization #3 (HLSL) Reduce Typecasting • And this is how it translates to assembly... Without Optimizations ps_2_0 def c2, 0, 0, 0, 0 dcl t0.xy dcl_2d s0 texld r0, t0, s0 mov r1.xyz, c1 add r7.xyz, c0, r1 mul r2.xyz, r7, r0 mov r2.w, c2.x mov oC0, r2 With Optimizations ps_2_0 dcl t0.xy dcl_2d s0 texld r0, t0, s0 mov r1, c1 add r7, c0, r1 mul r2, r7, r0 mov oC0, r2 Optimization #4 (HLSL) Avoid Integer Calculations Instead of integers rely on floats for math • HLSL supports integer arithmetic, but most hardware doesn’t • Compiler emulates int type support • – Precision and range might vary – Some extra code is generated Without Optimization float4 main(int k : TEXCOORD0) : COLOR { int n = k / 3; With Optimizations float4 main(float k : TEXCOORD0) : COLOR { float n = k / 3; return n; return n; } } Optimization #4 (HLSL) Avoid Integer Calculations • Assembly code confirms inefficiency... Without Optimization ps_2_0 def c0, 0.333333, 1, 0, 0 dcl t0.x mul r0.w, t0.x, c0.x frc r7.w, r0.w cmp r9.w, -r7.w, c0.w, c0.y add r4.w, r0.w, -r7.w cmp r11.w, r0.w, c0.w, c0.y mad r1, r9.w, r11.w, r4.w mov oC0, r1 With Optimization ps_2_0 def c0, 0.333333, 0, 0, 0 dcl t0.x mul r0, t0.x, c0.x mov oC0, r0 Optimization #5 (HLSL) Use Integers For Indexing • When indexing into arrays of constants use integers instead of floats Without Optimization With Optimizations float4x4 m[10]; float4x4 m[10]; float4 main( float4 p: POSITION, float2 ind: BLENDINDICES, float blend: BLENDWEIGHT) : POSITION { float4 p1 = mul(p, m[ind.x]); float4 p2 = mul(p, m[ind.y]); return lerp(p1, p2, blend); } float4 main( float4 p: POSITION, int2 ind: BLENDINDICES, float blend: BLENDWEIGHT) : POSITION { float4 p1 = mul(p, m[ind.x]); float4 p2 = mul(p, m[ind.y]); return lerp(p1, p2, blend); } Optimization #5 (HLSL) Use Integers For Indexing • Assembly shows that floats add rounding Without Optimization vs_2_0 def c40, 4, 0, 0, 0 dcl_position v0 dcl_blendindices v1 dcl_blendweight v2 frc r0.xy, v1 add r0.xy, -r0, v1 mul r0.xy, r0, c40.x mova a0.xy, r0 dp4 r0.x, v0, c0[a0.y] dp4 r0.y, v0, c1[a0.y] dp4 r0.z, v0, c2[a0.y] dp4 r0.w, v0, c3[a0.y] dp4 r1.x, v0, c0[a0.x] dp4 r1.y, v0, c1[a0.x] dp4 r1.z, v0, c2[a0.x] dp4 r1.w, v0, c3[a0.x] add r0, r0, -r1 mad oPos, v2.x, r0, r1 With Optimization vs_2_0 def c40, 4, 0, 0, 0 dcl_position v0 dcl_blendindices v1 dcl_blendweight v2 mul r0.xy, v1, c40.x mova a0.xy, r0 dp4 r0.x, v0, c0[a0.y] dp4 r0.y, v0, c1[a0.y] dp4 r0.z, v0, c2[a0.y] dp4 r0.w, v0, c3[a0.y] dp4 r1.x, v0, c0[a0.x] dp4 r1.y, v0, c1[a0.x] dp4 r1.z, v0, c2[a0.x] dp4 r1.w, v0, c3[a0.x] add r0, r0, -r1 mad oPos, v2.x, r0, r1 Optimization #6 (HLSL, ASM) Pack Scalar Constants • Combine scalar constants into full vectors – Reduces number of constants – Allows to work around hardware limitations (readport limit) Without Optimization With Optimization float scale, bias; float2 scale_bias; float4 main(float4 Pos : POSITION) : POSITION { return (Pos * scale + bias); } float4 main(float4 Pos : POSITION) : POSITION { return (Pos * scale_bias.x + scale_bias.y); } Optimization #6 (HLSL, ASM) Pack Scalar Constants • Here’s assembly version of this optimization Without Optimization vs_2_0 dcl_position v0 mul r0, v0, c0.x add oPos, r0, c1.x With Optimization vs_2_0 dcl_position v0 mad oPos, v0, c0.x, c0.y Optimization #7 (HLSL) Pack Arrays of Constants • Pack array elements into full constant vectors – Similar to previous optimization tip Without Optimization With Optimization float fArray[8]; float4 fPackedArray[2]; static float fArray[8] = (float[8])fPackedArray; float4 main(float4 v: COLOR) : COLOR { float a = 0; int i; for(i = 0; i < 8; i++) { a += fArray[i]; } return v * a; } float4 main(float4 v: COLOR) : COLOR { float a = 0; int i; for(i = 0; i < 8; i++) { a += fArray[i]; } return v * a; } Optimization #8 (HLSL) Properly Declare Constants • For conditional compilation use boolean constants declared as static Without Optimization With Optimization float4 a; bool b = true; float4 a; static bool b = true; float4 main( float4 i0: TEXCOORD0, float4 i1 : TEXCOORD1) : COLOR { if (b) return i0+a; else return i1+a; } float4 main( float4 i0: TEXCOORD0, float4 i1 : TEXCOORD1) : COLOR { if (b) return i0+a; else return i1+a; } Optimization #8 (HLSL) Properly Declare Constants • Assembly shows that when not declared as static both branches are evaluated Without Optimization ps_2_0 dcl t0 dcl t1 add r1, t0, c0 add r0, t1, c0 cmp r0, -c1.x, r0, r1 mov oC0, r0 With Optimization ps_2_0 dcl t0 add r0, t0, c0 mov oC0, r0 Optimization #9 (HLSL, ASM) Vectorize Calculations • Whenever possible vectorize code by joining similar operations together – It’s possible to perform up to 4-х operations in one shot Without Optimization float4 main(float k: COLOR) : COLOR { float a, b, c, d; a = k + 1; b = k + 2; c = k + 3; d = k + 4; return float4(a, b, c, d); } With Optimization float4 main(float k: COLOR) : COLOR { float4 v; v = k + float4(1,2,3,4); return v; } Optimization #9 (HLSL, ASM) Vectorize Calculations • The same optimization in assembly code… Without Optimization ps_2_0 def c0, 1, 2, 3, 4 dcl v0.x add r0.x, v0.x, c0.x add r0.y, v0.x, c0.y add r0.z, v0.x, c0.z add r0.w, v0.x, c0.w mov oC0, r0 With Optimization ps_2_0 def c0, 1, 2, 3, 4 dcl v0.x add r0, v0.x, c0 mov oC0, r0 Optimization #10 (HLSL, ASM) Vectorize Even More • Use similar approach to take advantage of special instructions available in shaders, i.e. use dot product instructions – Example: a+b+c+d Ù a*1+b*1+c*1+d*1 Ù DP4 Without Optimization float a = k b = k c = k d = j a = a a, b, c, d; * j; + j; - j; - k; + b + c + d; With Optimization float4 v; v.x = k * j; v.y = k + j; v.z = k - j; v.w = j - k; a = dot(v, 1); Optimization #10 (HLSL, ASM) Vectorize Even More • The same optimization in assembly… Without Optimization . . mul add sub sub add add add . . . r0.w, r1.w, r2.w, r3.w, r0.w, r0.w, r0.w, . r7.w, r7.w, r7.w, r8.w, r0.w, r0.w, r0.w, r8.w r8.w r8.w r7.w r1.w r2.w r3.w With Optimization def . . mul add sub sub dp4 . . c0, 1, 0, 0, 0 . r0.x, r7.w, r8.w r0.y, r7.w, r8.w r0.z, r7.w, r8.w r0.w, r8.w, r7.w r0.w, r0, c0.x . Optimization #11 (HLSL) Vectorize Comparisons • Currently conditional operators in HLSL don’t properly promote scalars to vectors – I.e. implementing a=(c>0.5f)?0.1f:0.9f; Without Optimization float4 main(float4 c: : COLOR { float4 a; a.x = (c.x > 0.5f) 0.1f : 0.9f; a.y = (c.y > 0.5f) 0.1f : 0.9f; a.z = (c.z > 0.5f) 0.1f : 0.9f; a.w = (c.w > 0.5f) 0.1f : 0.9f; return a; } COLOR) ? ? ? ? With Optimization float4 main(float4 c: COLOR) : COLOR { float4 a; a = (c > float4(0.5f,0.5f, 0.5f,0.5f)) ? float4(0.1f,0.1f, 0.1f,0.1f) : float4(0.9f,0.9f, 0.9f,0.9f); return a; } Optimization #11 (HLSL) Vectorize Comparisons • Vectorized comparison assembly Without Optimization ps_2_0 def c0, 0.5, 0.1, 0.9, 0 dcl v0 add r1.w, -v0.x, c0.x add r0.w, -v0.y, c0.x cmp r0.x, r1.w, c0.y, c0.z add r1.w, -v0.z, c0.x cmp r0.y, r0.w, c0.y, c0.z add r0.w, -v0.w, c0.x cmp r0.z, r1.w, c0.y, c0.z cmp r0.w, r0.w, c0.y, c0.z mov oC0, r0 With Optimization ps_2_0 def c0, 0.5, 0.1, 0.9, 0 dcl v0 add r1, -v0, c0.x cmp r0, r1, c0.y, c0.z mov oC0, r0 Optimization #12 (HLSL) Careful With Matrix Transpose Avoid transposing matrices in HLSL code • Use reversed mul() operand order for multiplication by transposed matrix • Use column order matrices whenever possible • Without Optimization With Optimization float3x4 m; float4x3 m; float4 main(float4 p: POSITION) : POSITION { float3 v; v = mul(m, p); return float4(v, 1); } float4 main(float4 p: POSITION) : POSITION { float3 v; v = mul(p, m); return float4(v, 1); } Optimization #12 (HLSL) Careful With Matrix Transpose • Column order matrix transformation takes 3 DP4 instructions vs. 4 MUL/MAD Without Optimization vs_2_0 def c4, 1, 0, 0, 0 dcl_position v0 mul r0.xyz, v0.x, c0 mad r2.xyz, v0.y, c1, r0 mad r4.xyz, v0.z, c2, r2 mad oPos.xyz, v0.w, c3, r4 mov oPos.w, c4.x With Optimization vs_2_0 def c3, 1, 0, 0, 0 dcl_position v0 m4x3 oPos.xyz, v0, c0 mov oPos.w, c3.x Optimization #13 (PS:HLSL,ASM) Use Swizzles Wisely • PS 2.0 doesn’t support arbitrary swizzles – Creatively use existing swizzles (i.e. .WZYX gives you access to .ZW in reversed order) – Can be used for constant packing and vectorization Without Optimization With Optimization float2 a, b, c, d; sampler s; float4 ab, cd; sampler s; float4 main(float2 i0: TEXCOORD0, float2 i1 : TEXCOORD1) : COLOR { float4 j = tex2D(s,(i0+a)*c) + tex2D(s,(i1+b)*d); return j; } float4 main(float4 i01 : TEXCOORD0) : COLOR { float4 t = (i01+ab)*cd; float4 j = tex2D(s,t.xy) + tex2D(s,t.wz); return j; } Optimization #13 (PS:HLSL,ASM) Use Swizzles Wisely • Optimization in assembly… – Using .ZW instead of .WZ would produce two MOV instructions instead of one Without Optimization ps_2_0 dcl t0.xy dcl t1.xy dcl_2d s0 add r0.xy, t0, c0 mul r0.xy, r0, c2 add r1.xy, t1, c1 mul r1.xy, r1, c3 texld r0, r0, s0 texld r1, r1, s0 add r0, r0, r1 mov oC0, r0 With Optimization ps_2_0 dcl t0 dcl_2d s0 add r0, t0, c0 mul r0, r0, c1 mov r1.xy, r0.wzyx texld r0, r0, s0 texld r1, r1, s0 add r0, r0, r1 mov oC0, r0 Optimization #14 (PS: HLSL) Use 1D Texture Fetches In DirectX® 1D textures are emulated by 2D textures of Nx1 dimensions • For fetching 1D textures use special intrinsic function tex1D even though texture is really 2D • Without Optimization With Optimization sampler texMap; float3 L; sampler texMap; float3 L; float4 main(float3 V : TEXCOORD0) : COLOR { float2 t = 0; t.x = dot(V, L); return tex2D(texMap, t); } float4 main(float3 V : TEXCOORD0) : COLOR { float t; t.x = dot(V, L); return tex1D(texMap, t); } Optimization #14 (PS: HLSL) Use 1D Texture Fetches • Equivalent code in assembly ... Without Optimization ps_2_0 def c1, 0, 0, 0, 0 dcl t0.xyz dcl_2d s0 dp3 r0.x, t0, c0 mov r0.y, c1.x texld r7, r0, s0 mov oC0, r7 With Optimization ps_2_0 dcl t0.xyz dcl_2d s0 dp3 r0.xy, t0, c0 texld r7, r0, s0 mov oC0, r7 Optimization #15 (PS: HLSL,ASM) Use Signed Textures Use signed texture formats to represent signed data (i.e. normal maps) • PS 2.0 doesn’t support _bx2 modifier, so it takes extra MAD instruction to expand range of unsigned data • ATI Hardware Specific Optimizations • Optimizations for DirectX® 9 members of RADEON family – RADEON 9500, 9600, 9700, 9800 Vertex shader optimizations • Pixel shader optimizations • Precision in pixel shaders • Vertex Shader Optimizations For RADEON 9500+ • Only a few optimization rules apply – Drivers do all the dirty work for you • Most relevant vertex shader optimizations for ATI DirectX® 9 hardware – Vertex data output from shaders – Instruction co-issue – Use of flow control ATI VS Optimization #1 Vertex Data Output • Export computed vertex position as early as possible – Driver does a good job of helping you with that – Shortest chain of instructions computing vertex position allows optimizer to do its job • Output from shader only necessary information – Don’t output texture coordinates with the same data – Use PS 1.4 and PS 2.0 for mapping texture coordinates to samplers ATI VS Optimization #2 Instruction Co-issue • Vertex processor architecture allows co-issuing vertex shader instructions (4D+1D per clock), somewhat similar to pixel shaders VP 4D (128-bit) • 1D (32-bit) Follow these rules to increase chance of coissue: – Don’t use scalar computations to write to output registers – Use write mask (.w) in POW, EXP, LOG, RCP and RSQ instructions – Remember that read port limits apply to instruction pair ATI VS Optimization #3 Flow Control • VS 2.0 supports constant based static branching and loops – Allows to simplify shader management and reduces number of shaders – The flow control instructions come for “free” – Drivers optimize flow control execution • Performance of shaders with flow control might be reduced due to the limited scope of shader optimizations Pixel Shader Optimizations For RADEON 9500+ • Optimized pixel shaders considerably increase graphics performance – Driver allows to get the most out of carefully designed shader • Most important pixel shader optimizations for ATI graphics hardware – – – – – – Texture instructions Dependent texture reads ALU instructions Instruction balancing Instruction co-issue Use of PS 1.4 ATI PS Optimization #1 TEXKILL Instruction • Avoid using TEXKILL whenever possible – Don’t use TEXKILL for user clip planes if clipping can be done at vertex level • Truth about TEXKILL instruction on ATI’s hardware: – Positioning of TEXKILL in shader code doesn’t affect performance – Shaders don’t have early out ability – TEXKILL instruction affects “top of the pipe Z-reject” efficiency – TEXKILL is a texture instruction, it contributes to creation of levels of dependency in the shader ATI PS Optimization #1 Dependency Levels And TEXKILL • Example of creating extra level of dependency in the pixel shader with TEXKILL – Note that first ALU instructions without texture reads count as dependency level in this example ps_2_0 def c0, 0.3, 1, 0.5, 0 sub r0, t0, c0 texkill r0 mov r1, v0 mov oC0, r1 - 1st level - 2nd level ATI PS Optimization #2 Depth Value Computations • Limit cases where depth value is output from pixel shaders – PS 1.4 – TEXDEPTH instruction – PS 2.0 – output to oDepth register • Truth about outputting depth value from pixel shaders – Affects Hyper-Z efficiency – Interferes with “top of the pipe Z-reject” ATI PS Optimization #3 Dependent Texture Read • Keep in mind that dependent texture reads aren’t free – 1-2 levels are executed at top performance – 3-4 levels are executed slower, but performance is still reasonable for practical use Try spreading instructions equally across levels of dependency • Avoid adding extra levels of dependency • ATI PS Optimization #3 Dependent Texture Read • Example of creating unnecessary dependent texture reads ps_2_0 dcl_2d s0 dcl t0.xy add r0.xy, t0, c0 texld r0, r0, s0 mul r1, r0, c8 add r0.xy, t0, c1 texld r0, r0, s0 mad r1, r0, c9, r1 add r0.xy, t0, c2 texld r0, r0, s0 mad r1, r0, c10, r1 mov oC0, r1 - 1st level - 2nd level ps_2_0 dcl_2d s0 dcl t0.xy add r0.xy, t0, c0 add r1.xy, t0, c1 add r2.xy, t0, c2 texld r0, r0, s0 mul r3, r0, c8 texld r1, r1, s0 mad r3, r1, c9, r3 texld r2, r2, s0 mad r3, r2, c10, r3 mov oC0, r3 - 3rd level - 4th level ATI PS Optimization #4 Arithmetic Instructions All simple instructions execute at the rate of 1 instruction/clock in each pipe • Most of the macros execute as described in DirectX® 9 documentation • – Macro SINCOS takes up as many as 8 clocks • There’s no penalty for immediately reusing computed result – Don’t try to be too “smart” with reordering instructions to hide inexistent latencies . . sub mad mul . . . r0, r0, c0 r0, r0, r1, c1 r0, r0, r0 . 1 instruction per clock ATI PS Optimization #5 Instruction Balancing Pixel engines can simultaneously read textures and perform ALU operations! • When texture bandwidth isn’t a bottleneck try to keep number of texture instructions close to number of arithmetic instructions • Can use function map lookups to balance texture vs. ALU instructions • However, always keep in mind image quality • ATI PS Optimization #6 Instruction Co-issue PS 2.0 model doesn’t support instruction pairing • RADEON 9500+ is still build around “dual-pipe” design • PP 3D (RGB) • 1D (Alpha) In PS 2.0 on RADEON 9500+ it’s still possible to pair vector calculations (RGB-pipe) with scalar operations (Alpha-pipe) to be executed on the same cycle ATI PS Optimization #6 Instruction Co-issue Driver optimizes for co-issue • Rules for taking advantage of automatic instruction pairing: • – Spread the computational load between color and alpha pipes – PS 2.0 doesn’t have explicit pairing; use write masks (.rgb and .a) for automatic co-issue in the driver – Use scalar instructions RCP, RSQ, EXP and LOG only in alpha pipe – Make it easier for optimizer to find co-issued instructions by placing them close to each other in the code ATI PS Optimization #6 Instruction Co-issue • Example: calculation of diffuse and specular lighting Without Optimization . . dp3 dp3 mul mul mul mul mul mad . . . r0, r1, r0 // N.H r2, r1, r2 // N.L r2, r2, r3 // * color r2, r2, r4 // * tex r0.r, r0.r, r0.r // spec^2 r0.r, r0.r, r0.r // spec^4 r0.r, r0.r, r0.r // spec^8 r0.rgb, r0.r, r5, r2 . Total: 8 instructions With Optimization . . dp3 dp3 mul mul mul mul mul mad . . . r0, r1, r0 // N.H r2.r, r1, r2 // N.L r6.a, r0.r, r0.r // spec^2 r2.rgb, r2.r, r3 // * color r6.a, r6.a, r6.a // spec^4 r2.rgb, r2, r4 // * tex r6.a, r6.a, r6.a // spec^8 r0.rgb, r6.a, r5, r2 . Total: 5 instructions ATI PS Optimization #7 Using PS 1.4 PS 2.0 doesn’t expose many instruction and operand modifiers • RADEON 9500+ architecture still supports many old modifiers • Use PS 1.4 to get access to modifiers in cheap shaders that require a lot of modifiers • ps_1_4 . . . dp3_d4 r0, r0_bx2, r1_bx2 . . . ps_2_0 def c0, . . . mad r0, mad r1, dp3 r0, mul r0, . . . 2, -1, 0.25, 0 r0, r1, r0, r0, c0.z, c0.y c0.z, c0.y r1 c0.z Precision In Pixel Shaders In RADEON 9500+ all pixel calculations happen in 24-bit float format (s7e16) • ATI hardware doesn’t support partial precision mode by design • – Insufficient precision when working with texture coordinates – Lower quality when computing reflections, specular lighting and procedural textures • Optimizations should take into account image quality especially when cinematographic quality is a goal Examples of Precision In Pixel Shaders • Point light source 16-bit precision 24-bit precision Examples of Precision In Pixel Shaders • Normal vector normalization in pixel shaders: cubemap vs. NRM instruction 256x256 cubemap NRM instruction