Presented at: Int’l. Performance, Computing and Communication Conference (IPCCC), Feb 10–12, 1999 Validation of Turandot, a Fast Processor Model for Microarchitecture Exploration Mayan Moudgill, Pradip Bose and Jaime H. Moreno IBM T. J. Watson Research Center Yorktown Heights, NY February 11, 1999 Pre–Silicon Testing and Validation Test/Validation Team Maxim: (Keeps scores of test engineers employed!) If it wasn’t tested, it doesn’t work. – usually means: testing for function Performance Team Maxim: (Keeps a few performance architects alive!) It doesn’t work until it works at target speed. – implies: testing for performance Caveat: The whole team: test, performance and all may be out of a job if the test/verification bottleneck delays time–to–market significantly. IPCCC–99 Moudgill, Bose, Moreno –1– 02–11–99 IBM What’s Performance? CPU Execution Time = Seconds/Program = (Instrs/prog) * (Cycles/instruction) * (Seconds/Cycle) (Compiler, ISA) PL (m/c organization) * CPI (dev/cct technology) * CT PL: path–length: number of instructions CPI : cycles–per–instruction CT : cycle time (nanoseconds) In this paper, our focus is on the CPI component IPCCC–99 Moudgill, Bose, Moreno –2– 02–11–99 IBM How’s Pre–Silicon Performance (CPI) Estimated? Source Prog (Fortran/C) xlf / xlc Binary (xcoff) File COMPILER Micro–arch parms Timer (C code) RS/6000 Machine Dynamic Trace Trace Generator (aria) Finite Cache Sim CPI, CPF stats, time–line output Could be replaced by a s/w or h/w functional simulator Typical PowerPC Processor Performance Modeling Timer: trace–driven, cycle–by–cycle pipeline simulator of the target microarchitecture Turandot: a fast PowerPC Research Timer (developed by Mayan Moudgill) IPCCC–99 Moudgill, Bose, Moreno –3– 02–11–99 IBM Performance Validation: Components of the Problem Input (workload) MODEL Output (data, results, timelines) Workload (trace) validation Model validation: (our primary focus in this talk) Data (results) validation Performance Bugs: Two main categories Overall modelling errors, s/w bugs Design deficiency related perf gaps IPCCC–99 Moudgill, Bose, Moreno –4– 02–11–99 IBM Processor Core Organization (as modeled by Turandot) Icache, I–TLB Ifetch/BP Decode/expand Rename/disp issue queue (integer) Integer units issue–queue (load/store) issue–queue (float) issue–queue (branch) Load–store units Floating point units Branch units Dcache, D–TLB Load–store queues and LS reorder buffer IPCCC–99 Moudgill, Bose, Moreno Retirement queue L2, memory hierarchy also modeled (not shown) –5– 02–11–99 IBM (REALISTIC APPROACH (at this time) (biasing) Parm fault model Micro–arch Parms Proven analytic model (eliot) (specs) Test Case Generation “Gold” reference model (manual/automatic; focussed/random) S/w model or h/w box under test (R–model) Expected performance signature(s) Measured performance signature(s) Check for agreement ' Limitations: n coverage issues (as in classical testing) n confidence (provability) of “gold” signatures n ease of integration with existing simulation–based functional validation IPCCC–99 Moudgill, Bose, Moreno –6– 02–11–99 IBM The Problem of Specification Example instruction sequence (test case): fadd f1, stfd f1, lfd f4, f2, f3 # add: C[f1] + C[f2] ––> C[f1] 8(g8) # store f1 into addr A = C[g8] + 8 8(g8) # load f4 from addr A = C[g8] + 8 ISA specs of individual instructions allows a gold reference model to predict visible register states. But, can the “expected” cycle–count be specified or predicted from microarchitecture description? Should it run in 1, 2, 3, 4, ... machine cycles? (Note: writing a simulator to predict “gold counts” is not good enough, since that model itself may have bugs). IPCCC–99 Moudgill, Bose, Moreno –7– 02–11–99 IBM Atomic Instruction Flow “Specs” fadd instruction Cycle –> Action–> n n+1 IF DE; RN; DS n+2 FRR n+3 ISS n+4 EX1 n+5 EX2 n+6 EX3 n+7 n+8 WBF CMP stfd instruction Cyc–> Act–> (agen) Act–> (dmov) n n+1 n+2 n+3 n+4 n+5 n+6 n+7 WSQ FIN CMP IF DE; RN; IRR ISS EA; DS TL DE; DS FRR ISS WSQ FIN – CMP n+8 n+9 – CWR lfd instruction Cycles –> Action–> n IF n+1 DE; RN; DS IPCCC–99 Moudgill, Bose, Moreno n+2 n+3 n+4 IRR ISS EA; DL; TL –8– n+5 CA; AL n+6 n+7 WBF CMP 02–11–99 IBM Can one infer pair behavior from atomic specs? ' ' Not without additional atomic specs: n fadd–stfd dependence “bubble” latency n stfd–lfd dependence “bubble” latency Other microarchitecture parms: n issue width n register ports n .... Are “pair” dependence specs always enough to allow prediction for a 3–instruction dependence chain? What about general basic block code sequences? ' Basic–block execution cost estimator (driven by high–level microarchitecture parms file). IPCCC–99 Moudgill, Bose, Moreno –9– 02–11–99 IBM Idealized Bounds Model (I–BOUND) (loop performance) cycles–per–iteration cpi = cpI ––––––––––– N no. of instructions–per–iteration cycles–per–instruction cpI = max (cpIfetch–bound, cpIagen–bound, cpIstore–port–bound, cpIdispatch–bound, cpIlsu–issue–bound, cpIfpu–issue–bound, cpIcompl–bound) cpIfetch–bound = N/fetch_bw cpIload–port–bound = NL/l_ports cpIdisp–bound = N/disp_bw cpIagen–bound = (NL + NS)/ls_units .... etc. IPCCC–99 Moudgill, Bose, Moreno – 10 – 02–11–99 IBM EXAMPLE I–BOUND CALCULATION (loop03) lfd fr0, 0008(r9) lfdu fr1, 0010(r8) fadd fr0, fr3, fr0 lfdu fr2, 0010(r9) fadd fr2, fr1, fr2 lfd fr1, 0008(r9) stfd fr0, 0008(r5) lfd fr0, 0008(r8) stfdu fr2, 0010(r5) lfdu fr2, 0008(r10) fadd fr0, fr0, fr1 lfdu fr1, 0010(r9) lfd fr3, 0008(r8) fadd fr1, fr2, fr1 stfd fr0, 0008(r5) stfdu fr1, 0010(r5) bc /* branch conditionally to top of loop */ Idealized Analytical Bounds Model: (Infinite queues/buffers, infinite cache 2 LSU, 2 FPU, 1 cache store port, 2 cache load ports, dispatch 4 instrs/cycle max, complete 4 instrs/cycle max): dispatch–bound or compl–bound cpI = (16/4) + 1 = 5 agen–bound cpI = 12/2 = 6 cache–load–port–bound cpI = 8/2 = 4 cache–store–port–bound cpI = 4/1 = 4 So, overall cpI bound = 6 So, idealized cpi(steady–state) = 6/17 = 0.353; cpf(steady–state) = 1.5 IPCCC–99 Moudgill, Bose, Moreno – 11 – 02–11–99 IBM FAULT DICTIONARY STRUCTURE Fault–Free Signature Dictionary (Infinite Cache/TLB Mode): Example Structure Test Case Cl Class Test Case N Name Early–Stage Signatures Late–Stage Signatures Cycle Count (single invocation or iteration) Cycle Count (fixed number of iterations, n) Steady– State, Loop– Mode cpi (IB : RB) (IB : RB) (IB : RB) . . . . . . Steady– State, Loop– Mode Pipeline State Transition Sequence Cycle–by– cycle pipeline state signature SI1 Single instruction t t cases test Pair test cases . . . . . . SIn1 PA1 . . PAn2 Block (l (loop) ) test cases LP1 . . LPn3 Complex, multi–block lti bl k test cases IPCCC–99 Moudgill, Bose, Moreno – 12 – 02–11–99 IBM Turandot Calibration (against R–model) Validation Procedure: 1. Exercise the reference R–model and the initial (non–validated) Turandot model, for the stated configurations with the SPECint95 trace test suite (10 million instructions per workload). 2. Record the deviations in the aggregate cycles–per–instruction (CPI) for each workload. 3. If the deviations are large, then focus on intrinsic testing (single instructions, pairs, sequences, and basic loop tests) to calibrate Turandot against R–model; else, if deviations are already within an acceptable margin, go to Step 6 (i.e., skip steps 4–5). 4. Once intrinsic–level calibration has been achieved, exercise Turandot again with the SPECint95 trace test suite. If deviations are within an acceptable margin, go to Step 6 (i.e. skip step 5). 5. Attempt cycle–by–cycle validation for selected short instruction sequences from “hot spots” within SPECint95 trace test suite for calibration, if necessary. Use an independent analytical bounds model reference (e.g.., eliot–based predictions) to aid in daignosing discrepancies between the models, if needed. 6. Once the CPI deviations are within acceptable limits, investigate a set of other key statistics, reported by the models: e.g., number of instructions issued relative to the number of instructions completed, histograms of resource utilization, etc. If major mismatches arse still observed, go back to Step 5 using additional block and loop test cases as needed, and exercise untested regions of the model. 7. Terminate the procedure when an acceptable level of calibration is achieved across all test cases as well as the benchmark reference trace suite (SPECint95). IPCCC–99 Moudgill, Bose, Moreno – 13 – 02–11–99 IBM Results Summary Table 2. Initial CPI comparisons using SPECint95 sampled traces (10M instrs each) Trace Error with respect to R–model (%) InfPrf Prf Inf Std compress –48.0 –44.2 –37.4 –34.9 gcc –28.5 –31.0 –18.4 –19.8 go –32.1 –32.6 –24.8 –25.8 ijpeg –41.2 –39.0 –37.5 –36.6 li –25.5 –28.6 –20.7 –22.8 m88ksim –28.6 –22.2 –28.3 –22.0 perl –34.8 –34.8 –22.5 –18.6 vortex –24.5 –23.3 –23.5 –22.6 Average error –32.9 –31.9 –26.6 –25.3 Table 8. Final CPI deviation from R–model for SPECint95 sampled traces (10M instrs) Trace Validated Turandot deviation (%) InfPrf Prf Inf Std compress –4.7 –4.2 –9.6 –1.7 gcc –7.1 –5.9 –1.9 –0.2 go –4.1 –4.8 +0.7 –0.2 ijpeg –1.3 –0.6 –1.5 –1.1 li –9.1 –9.8 +3.3 +1.3 m88ksim –8.4 +4.4 –8.0 +3.1 perl –8.3 –8.0 +5.1 +5.5 vortex +2.6 +8.9 –0.4 +5.4 Average (absolute) error 5.7 5.8 3.8 2.3 IPCCC–99 Moudgill, Bose, Moreno – 14 – 02–11–99 IBM Output Data Validation (example) Trace CPI Turandot–V deviation (%) CPO EXF NOI/NOC disp_succ compress –3.8 –4.1 0.0 –4.6 13.5 gcc –8.3 –8.3 0.0 –2.9 11.2 go –5.1 –5.9 0.0 –1.9 15.1 ijpeg –1.3 –1.6 0.0 –7.3 40.3 li –9.1 –9.5 0.0 –5.6 7.3 Count and aggregate metrics: CYC NIC NOC EXF CPI CPO NOD NOI total number of cycles total number of PowerPC instructions completed total number of internal operations completed expansion factor= NOC/NIC cycles per instruction= CYC/NIC cycles per internal operation= CYC/NOC total number of internal operations dispatched total number of internal operations issued Dispatch stall indicators: disp_idle: percentage of cycles when no instr was available for dispatch other metrics: e.g. disp_succ, store_queue_full, etc. IPCCC–99 Moudgill, Bose, Moreno – 15 – 02–11–99 IBM Conclusion ' ' Pre–silicon validation of Turandot: a fast, programmable processor performance simulator n against a much slower (> factor of 75), pre–RTL reference R–model n an analytical bounds model (eliot) was used as an additional, independent reference A systematic, step–by–step methodology was used to achieve calibration n ' final Turandot benchmark CPI numbers are within 5 % of the R–model Results demonstrate that the methodology allows quick convergence to reference model, without sacrificing simulation speed IPCCC–99 Moudgill, Bose, Moreno – 16 – 02–11–99 IBM