Superscalar performance exploration Trace-driven performance exploration of a PowerPC 601 OLTP workload on wide superscalar processors J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 Why yet another superscalar processor model and performance evaluation? Superscalar processors continue dominating the field No apparent likelihood of ending superscalar paradigm in near future Continuing improvements in features and capabilities Certain aspects getting easier due to number of transistors available Existing programs (binary compatibility) Need for evaluating new implementation challenges High frequency objectives New structures and algorithms Wider instruction issue Need to understand impact of various classes of workloads "Commercial" workloads .... Moreno et al., 01/26/98 1-2 Superscalar performance exploration The MET: Microarchitecture Exploration Toolset Collection of tools for exploration of microarchitecture features Trace-driven and execution-driven tools Fast simulation: >300 M inst/hour Intended to support early exploration of processor organizations detailed model of generalized pipeline trends among results instead of their magnitudes Processor organization NFA/Branch L1-I cache Predictor I-Fetch I-TLB I-Buffer I-Prefetch Decode/ Expand L2 cache Rename/ Dispatch Main Memory Cast-out Issue queue Integer Issue queue Load/store Issue queue Float.Point Issue queue Branch Issue logic Issue logic Issue logic Issue logic Reg. read Reg. read Reg. read Reg. read Integer units Load/store units Floating-Point units Branch units queue L1-D cache Load/store reorder buffer D-TLB Store queue TLB2 Moreno et al., 01/26/98 Miss queue Retirement queue Retirement logic 3-4 Superscalar performance exploration Pipeline stages Integer Load Floating point Fetch Decode Rename Expand Dispatch Issue Read Exec Fetch Decode Rename Expand Dispatch Issue Read EA Fetch Decode Rename Expand Dispatch Issue Read Exec1 WB Retire Dcache access Exec2 Exec3 WB Retire WB Retire OLTP and GCC traces OLTP Length Branch instructions Branches taken Instrs. in kernel space Memory access instructions Load/store multiple instructions String instructions Load/store w/update instrs. Average block size Mispredicted instructions, addresses Moreno et al., 01/26/98 172 M instructions, user and kernel space 18.9 % 44.3 % 22.1 % 34.8 % 1.6 % 1.4 % 1.7 % 5.3 instrs. No GCC 1212 M instructions, user space 21.6 % 56.3 % n/a 27.1 % 1.6 % 0.3 % 2.8 % Yes 5-6 Superscalar performance exploration CPI adder for various processor configurations CPI adder CPI 1.2 Finite/non-perfect Infinite/perfect 1.0 0.8 0.6 0.4 0.2 0.0 OLTP gcc95 Much larger adder in the case of OLTP workload Miss rates (per 1000 instructions): 64K L1, 2M L2, 128 entries TLB, 8K entries BHT OLTP I1 21.3 8.3 I2 3.1 0.02 D1 22.4 9.8 1.8 1.8 0.02 ~0 4.5 5.3 % 1.5 7.0 % D2 I TLB D TLB Conditional branch misprediction Moreno et al., 01/26/98 GCC 7-8 Superscalar performance exploration Exploration space (in this presentation) Issue policy Width Fetch/Dispatch/Retire Class-order Cache size L1-I, L1-D, L2 Branch prediction 4/4/6 64K, 64K, 2M Out-of-order 8192 entry BHT, 4096 BTAC 8/8/12 128K, 128K, 4M Perfect 12/12/16 128K, 128K, Inf Inf, Inf, Inf Widths Units Ports Queues Physical registers Fetch/ FX/FP/LS/BR Data cache Issue/ GPR/FPR/CCR/SPR Dispatch/ and TLB Retire/ Retire IBuf 4/4/6 3/2/2/2 2 20(12)/128/24 80/80/32/64 8/8/12 6/4/4/4 4 40/160/48 128/128/64/96 12/12/16 8/4/6/4 6 60/160/72 128/128/64/96 Other parameters (examples) Sizes I-prefetch buffer (entries) Miss queue, cast-out queue (entries) Store queue, reorder buffer (entries) D/I-TLBs (entries) TLB2 (entries) L1-I/D, L2-cache line size (bytes) Page size (bytes) Latencies Branch prediction I-prefetch buffer latency (cycles) D/I-TLBs miss penalty (cycles) TLB2 miss penalty (cycles) L1-I/D cache miss penalty (cycles) L2 cache miss penalty (cycles) BTAC (entries) LR stack size (entries) Branch history table (entries) Moreno et al., 01/26/98 4 8 31 128 1024 128 4096 1 4 40 8. 7 40 4096 32 8192 9-10 Superscalar performance exploration CPI adders due to issue policy (as % of base case) CPI CPI 1.2 15 33 37 29 32 4StBp 8LgBp 12LgBp 60 70 4LgBp 4InfBp 8StPf 12StPf 4StPf 8LgPf 35 40 8IL2Bp 21 16 12IL2Bp 40 47 12LgPf 4LgPf 8IL2Pf 12IL2Pf 4IL2Pf 8InfPf 0.0 4InfPf 0.5 45 53 34 38 99 125 12InfPf 35 16 4IL2Bp 21 18 8InfBp 21 12InfBp 1.0 8StBp Class-order Out-of-order 12StBp OLTP 1.5 GCC Class-order Out-of-order 0.8 74 30 73 127 154 152 209 30 51 59 48 53 12StBp 8StBp 4StBp 12InfBp 8InfBp 4InfBp 12StPf 8StPf 4StPf 12InfPf 8InfPf 0.0 4InfPf 0.4 CPI adders due to branch prediction (as % of base case) CPI OLTP 18 o8St o12St 88 107 o12St o4St 26 o8St c12St c8St 28 CPI GCC 1.2 Imperfect Perfect 1.0 0.8 20 22 24 c4St 33 o12Lg 27 18 19 o8Lg 21 o4IL2 c12IL2 o12Inf c8IL2 54 c4IL2 42 19 c12Lg 21 15 17 c8Lg 18 26 o8Inf c8Inf 0.0 c4Inf 0.5 16 o4Inf 14 c12Inf 13 14 o12IL2 15 o8IL2 1.0 o4Lg Imperfect Perfect c4Lg 1.5 21 22 19 23 25 24 58 63 0.6 103 143 0.4 Moreno et al., 01/26/98 o4St c12St c8St c4St o12Inf o8Inf o4Inf c12Inf c8Inf 0.0 c4Inf 0.2 11-12 Superscalar performance exploration CPI adders due to cache size (as % of base case) CPI OLTP 1.5 St Lg IL2 Inf 1.0 31 38 o12Bp 66 o8Bp o4Bp 92 c12Bp 79 c8Bp o4Pf c12Pf c8Pf c4Pf 36 60 0.5 0.0 35 44 c4Bp 31 o12Pf 31 o8Pf 29 CPI GCC 1.2 St Inf 1.0 0.8 0.6 0.4 o12Bp o8Bp o4Bp c12Bp c8Bp c4Bp o12Pf o8Pf o4Pf c8Pf c4Pf 0.0 c12Pf 0.2 CPI adders due to processor width (as % of base case) CPI OLTP 1.5 w=4 w=8 w=12 1.0 14 52 CPI GCC 1.2 w=4 w=8 1.0 0.8 0.6 oStBp cStBp cLgBp oLgBp oIL2Bp cIL2Bp cInfBp oInfBp cStPf oStPf oLgPf cLgPf oIL2Pf cIL2Pf oInfPf 71 cInfPf 0.0 23 27 30 15 10 10 12 27 32 37 16 0.5 13 13 7 9 11 10 0.4 w=12 23 27 54 59 Moreno et al., 01/26/98 oStBp cStBp cInfBp oInfBp cStPf oStPf oInfPf 0.0 cInfPf 0.2 13-14 Superscalar performance exploration In OLTP workload "Least-aggressive" configurations considered 15 to 32% degradation due to class-order issue more severe degradation expected for in-order policy 15 to 26% degradation due to imperfect branch predictor 30 to 66% degradation due to finite L1 cache (128K) 10 to 23% degradation due to processor width Diminishing benefits beyond dispatching eight operations per cycle conventional instruction fetching mechanism Still many microarchitecture issues to investigate in detail Observations Clear differences in OLTP behavior relative to GCC memory penalties in OLTP shadow other effects Caveats due to use of traces length number of traces (just one in this presentation) observability in OLTP no mispredicted paths time scaling in GCC no kernel code Moreno et al., 01/26/98 15-16 Superscalar performance exploration Summary Environment for early exploration fast, flexible trends among aggressive superscalar organizations Behavior of OLTP workload very different from others (i.e., SPEC) different microarchitecture tradeoffs Aggressive superscalar buildable? need to quantify potential performance from realizable implementation need to identify/develop features that provide better return CPI results in OLTP workload Issue policy c: Class-order o: Out-of-order Moreno et al., 01/26/98 Width Bp: 2-bit branch history table Pf: Perfect branch predictor (8192 entries) Inf IL2 Lg St Inf IL2 Lg St 4 0.82 1.07 1.18 1.29 0.72 0.93 1.03 1.12 8 0.71 0.96 1.07 1.18 0.62 0.81 0.91 1.00 12 0.70 0.95 1.06 1.17 0.60 0.79 0.89 0.97 4 0.67 0.93 1.02 1.12 0.53 0.77 0.86 0.95 8 0.44 0.71 0.81 0.91 0.31 0.56 0.65 0.75 12 0.41 0.68 0.77 0.88 0.27 0.51 0.60 0.70 17-18