"Validation of Turandot, a Fast Processor Model for Microarchitecture Exploration,"

Presented at: Int’l. Performance, Computing and Communication
Conference (IPCCC), Feb 10–12, 1999
Validation of Turandot, a Fast Processor Model for
Microarchitecture Exploration
Mayan Moudgill, Pradip Bose and Jaime H. Moreno
IBM T. J. Watson Research Center
Yorktown Heights, NY
February 11, 1999
Pre–Silicon Testing and Validation
Test/Validation Team Maxim:
(Keeps scores of test engineers employed!)
If it wasn’t tested, it doesn’t work.
– usually means: testing for function
Performance Team Maxim:
(Keeps a few performance architects alive!)
It doesn’t work until it works at
target speed.
– implies: testing for performance
Caveat: The whole team: test, performance and all may be
out of a job if the test/verification bottleneck delays
time–to–market significantly.
IPCCC–99
Moudgill, Bose, Moreno
–1–
02–11–99
IBM
What’s Performance?
CPU Execution Time = Seconds/Program =
(Instrs/prog) * (Cycles/instruction) * (Seconds/Cycle)
(Compiler, ISA)
PL
(m/c organization)
*
CPI
(dev/cct technology)
*
CT
PL: path–length: number of instructions
CPI : cycles–per–instruction
CT : cycle time (nanoseconds)
In this paper, our focus is on the CPI component
IPCCC–99
Moudgill, Bose, Moreno
–2–
02–11–99
IBM
How’s Pre–Silicon Performance (CPI)
Estimated?
Source Prog
(Fortran/C)
xlf / xlc
Binary
(xcoff)
File
COMPILER
Micro–arch parms
Timer (C code)
RS/6000
Machine
Dynamic
Trace
Trace
Generator
(aria)
Finite Cache Sim
CPI, CPF stats,
time–line output
Could be replaced by a s/w or h/w
functional simulator
Typical PowerPC Processor Performance Modeling
Timer: trace–driven, cycle–by–cycle pipeline simulator of
the target microarchitecture
Turandot: a fast PowerPC Research Timer
(developed by Mayan Moudgill)
IPCCC–99
Moudgill, Bose, Moreno
–3–
02–11–99
IBM
Performance Validation:
Components of the Problem
Input
(workload)
MODEL
Output
(data, results,
timelines)
Workload (trace) validation
Model validation: (our primary focus in this talk)
Data (results) validation
Performance Bugs: Two main categories
Overall modelling errors, s/w bugs
Design deficiency related perf gaps
IPCCC–99
Moudgill, Bose, Moreno
–4–
02–11–99
IBM
Processor Core Organization
(as modeled by Turandot)
Icache,
I–TLB
Ifetch/BP
Decode/expand
Rename/disp
issue queue
(integer)
Integer units
issue–queue
(load/store)
issue–queue
(float)
issue–queue
(branch)
Load–store
units
Floating point
units
Branch units
Dcache,
D–TLB
Load–store
queues
and
LS reorder
buffer
IPCCC–99
Moudgill, Bose, Moreno
Retirement
queue
L2, memory hierarchy also
modeled (not shown)
–5–
02–11–99
IBM
(REALISTIC APPROACH
(at this time)
(biasing)
Parm fault
model
Micro–arch
Parms
Proven analytic
model (eliot)
(specs)
Test Case
Generation
“Gold”
reference
model
(manual/automatic;
focussed/random)
S/w model or
h/w box under test
(R–model)
Expected
performance
signature(s)
Measured
performance
signature(s)
Check for
agreement
'
Limitations:
n
coverage issues (as in classical testing)
n
confidence (provability) of “gold” signatures
n
ease of integration with existing
simulation–based functional validation
IPCCC–99
Moudgill, Bose, Moreno
–6–
02–11–99
IBM
The Problem of Specification
Example instruction sequence (test case):
fadd f1,
stfd f1,
lfd
f4,
f2, f3 # add: C[f1] + C[f2] ––> C[f1]
8(g8) # store f1 into addr A = C[g8] + 8
8(g8) # load f4 from addr A = C[g8] + 8
ISA specs of individual instructions allows a gold
reference model to predict visible register states.
But, can the “expected” cycle–count be specified or
predicted from microarchitecture description?
Should it run in 1, 2, 3, 4, ... machine cycles?
(Note: writing a simulator to predict “gold counts” is not
good enough, since that model itself may have bugs).
IPCCC–99
Moudgill, Bose, Moreno
–7–
02–11–99
IBM
Atomic Instruction Flow
“Specs”
fadd instruction
Cycle
–>
Action–>
n
n+1
IF
DE;
RN;
DS
n+2
FRR
n+3
ISS
n+4
EX1
n+5
EX2
n+6
EX3
n+7
n+8
WBF
CMP
stfd instruction
Cyc–>
Act–>
(agen)
Act–>
(dmov)
n n+1 n+2 n+3 n+4 n+5 n+6 n+7
WSQ FIN CMP
IF DE;
RN; IRR ISS EA;
DS
TL
DE;
DS FRR ISS WSQ FIN –
CMP
n+8 n+9
– CWR
lfd instruction
Cycles
–>
Action–>
n
IF
n+1
DE;
RN;
DS
IPCCC–99
Moudgill, Bose, Moreno
n+2
n+3
n+4
IRR
ISS
EA;
DL;
TL
–8–
n+5
CA;
AL
n+6
n+7
WBF CMP
02–11–99
IBM
Can one infer pair behavior
from atomic specs?
'
'
Not without additional atomic specs:
n
fadd–stfd dependence “bubble” latency
n
stfd–lfd dependence “bubble” latency
Other microarchitecture parms:
n
issue width
n
register ports
n
....
Are “pair” dependence specs always enough to allow
prediction for a 3–instruction dependence chain?
What about general basic block code sequences?
'
Basic–block execution cost estimator (driven by high–level
microarchitecture parms file).
IPCCC–99
Moudgill, Bose, Moreno
–9–
02–11–99
IBM
Idealized Bounds Model (I–BOUND)
(loop performance)
cycles–per–iteration
cpi =
cpI
–––––––––––
N
no. of instructions–per–iteration
cycles–per–instruction
cpI = max (cpIfetch–bound, cpIagen–bound, cpIstore–port–bound,
cpIdispatch–bound, cpIlsu–issue–bound, cpIfpu–issue–bound,
cpIcompl–bound)
cpIfetch–bound = N/fetch_bw
cpIload–port–bound = NL/l_ports
cpIdisp–bound = N/disp_bw
cpIagen–bound = (NL + NS)/ls_units
....
etc.
IPCCC–99
Moudgill, Bose, Moreno
– 10 –
02–11–99
IBM
EXAMPLE I–BOUND CALCULATION
(loop03)
lfd
fr0, 0008(r9)
lfdu
fr1, 0010(r8)
fadd fr0, fr3, fr0
lfdu fr2, 0010(r9)
fadd fr2, fr1, fr2
lfd
fr1, 0008(r9)
stfd fr0, 0008(r5)
lfd
fr0, 0008(r8)
stfdu fr2, 0010(r5)
lfdu
fr2, 0008(r10)
fadd fr0, fr0, fr1
lfdu
fr1, 0010(r9)
lfd
fr3, 0008(r8)
fadd fr1, fr2, fr1
stfd
fr0, 0008(r5)
stfdu fr1, 0010(r5)
bc /* branch conditionally to top of loop */
Idealized Analytical Bounds Model:
(Infinite queues/buffers, infinite cache 2 LSU, 2 FPU, 1 cache store port, 2
cache load ports, dispatch 4 instrs/cycle max, complete 4 instrs/cycle max):
dispatch–bound or compl–bound cpI = (16/4) + 1 = 5
agen–bound cpI = 12/2 = 6
cache–load–port–bound cpI = 8/2 = 4
cache–store–port–bound cpI = 4/1 = 4
So, overall cpI bound = 6
So, idealized cpi(steady–state) = 6/17 = 0.353; cpf(steady–state) = 1.5
IPCCC–99
Moudgill, Bose, Moreno
– 11 –
02–11–99
IBM
FAULT DICTIONARY STRUCTURE
Fault–Free Signature Dictionary (Infinite Cache/TLB Mode): Example Structure
Test Case
Cl
Class
Test Case
N
Name
Early–Stage Signatures
Late–Stage Signatures
Cycle
Count
(single invocation or
iteration)
Cycle
Count
(fixed number of iterations, n)
Steady–
State,
Loop–
Mode cpi
(IB : RB)
(IB : RB)
(IB : RB)
.
.
.
.
.
.
Steady–
State,
Loop–
Mode
Pipeline
State Transition Sequence
Cycle–by–
cycle pipeline state
signature
SI1
Single
instruction
t t cases
test
Pair test
cases
.
.
.
.
.
.
SIn1
PA1
.
.
PAn2
Block
(l
(loop)
)
test cases
LP1
.
.
LPn3
Complex,
multi–block
lti bl k
test cases
IPCCC–99
Moudgill, Bose, Moreno
– 12 –
02–11–99
IBM
Turandot Calibration
(against R–model)
Validation Procedure:
1. Exercise the reference R–model and the initial (non–validated) Turandot model,
for the stated configurations with the SPECint95 trace test suite (10 million
instructions per workload).
2. Record the deviations in the aggregate cycles–per–instruction (CPI) for each
workload.
3. If the deviations are large, then focus on intrinsic testing (single instructions, pairs,
sequences, and basic loop tests) to calibrate Turandot against R–model; else, if
deviations are already within an acceptable margin, go to Step 6 (i.e., skip steps
4–5).
4. Once intrinsic–level calibration has been achieved, exercise Turandot again with
the SPECint95 trace test suite. If deviations are within an acceptable margin, go to
Step 6 (i.e. skip step 5).
5. Attempt cycle–by–cycle validation for selected short instruction sequences from
“hot spots” within SPECint95 trace test suite for calibration, if necessary. Use an
independent analytical bounds model reference (e.g.., eliot–based predictions) to
aid in daignosing discrepancies between the models, if needed.
6. Once the CPI deviations are within acceptable limits, investigate a set of other key
statistics, reported by the models: e.g., number of instructions issued relative to the
number of instructions completed, histograms of resource utilization, etc. If major
mismatches arse still observed, go back to Step 5 using additional block and loop
test cases as needed, and exercise untested regions of the model.
7. Terminate the procedure when an acceptable level of calibration is achieved across
all test cases as well as the benchmark reference trace suite (SPECint95).
IPCCC–99
Moudgill, Bose, Moreno
– 13 –
02–11–99
IBM
Results Summary
Table 2. Initial CPI comparisons using SPECint95 sampled traces (10M instrs each)
Trace
Error with respect to R–model (%)
InfPrf
Prf
Inf
Std
compress
–48.0
–44.2
–37.4
–34.9
gcc
–28.5
–31.0
–18.4
–19.8
go
–32.1
–32.6
–24.8
–25.8
ijpeg
–41.2
–39.0
–37.5
–36.6
li
–25.5
–28.6
–20.7
–22.8
m88ksim
–28.6
–22.2
–28.3
–22.0
perl
–34.8
–34.8
–22.5
–18.6
vortex
–24.5
–23.3
–23.5
–22.6
Average error
–32.9
–31.9
–26.6
–25.3
Table 8. Final CPI deviation from R–model for SPECint95 sampled traces (10M instrs)
Trace
Validated Turandot deviation (%)
InfPrf
Prf
Inf
Std
compress
–4.7
–4.2
–9.6
–1.7
gcc
–7.1
–5.9
–1.9
–0.2
go
–4.1
–4.8
+0.7
–0.2
ijpeg
–1.3
–0.6
–1.5
–1.1
li
–9.1
–9.8
+3.3
+1.3
m88ksim
–8.4
+4.4
–8.0
+3.1
perl
–8.3
–8.0
+5.1
+5.5
vortex
+2.6
+8.9
–0.4
+5.4
Average (absolute) error
5.7
5.8
3.8
2.3
IPCCC–99
Moudgill, Bose, Moreno
– 14 –
02–11–99
IBM
Output Data Validation
(example)
Trace
CPI
Turandot–V deviation (%)
CPO
EXF
NOI/NOC
disp_succ
compress
–3.8
–4.1
0.0
–4.6
13.5
gcc
–8.3
–8.3
0.0
–2.9
11.2
go
–5.1
–5.9
0.0
–1.9
15.1
ijpeg
–1.3
–1.6
0.0
–7.3
40.3
li
–9.1
–9.5
0.0
–5.6
7.3
Count and aggregate metrics:
CYC
NIC
NOC
EXF
CPI
CPO
NOD
NOI
total number of cycles
total number of PowerPC instructions completed
total number of internal operations completed
expansion factor= NOC/NIC
cycles per instruction= CYC/NIC
cycles per internal operation= CYC/NOC
total number of internal operations dispatched
total number of internal operations issued
Dispatch stall indicators:
disp_idle: percentage of cycles when no instr was available for dispatch
other metrics: e.g. disp_succ, store_queue_full, etc.
IPCCC–99
Moudgill, Bose, Moreno
– 15 –
02–11–99
IBM
Conclusion
'
'
Pre–silicon validation of Turandot: a fast,
programmable processor performance simulator
n
against a much slower (> factor of 75), pre–RTL
reference R–model
n
an analytical bounds model (eliot) was used as
an additional, independent reference
A systematic, step–by–step methodology was used
to achieve calibration
n
'
final Turandot benchmark CPI numbers are
within 5 % of the R–model
Results demonstrate that the methodology allows
quick convergence to reference model, without
sacrificing simulation speed
IPCCC–99
Moudgill, Bose, Moreno
– 16 –
02–11–99
IBM