"Trace-driven performance exploration of a PowerPC 601 OLTP workload on wide superscalar processors,"

Superscalar performance exploration
Trace-driven performance exploration of a PowerPC 601
OLTP workload on wide superscalar processors
J. H. Moreno, M. Moudgill, J.D. Wellman, P.Bose, L. Trevillyan
IBM Thomas J. Watson Research Center
Yorktown Heights, NY 10598
Why yet another superscalar processor model and performance evaluation?
Superscalar processors continue dominating the field
No apparent likelihood of ending superscalar paradigm in near future
Continuing improvements in features and capabilities
Certain aspects getting easier due to number of transistors available
Existing programs (binary compatibility)
Need for evaluating new implementation challenges
High frequency objectives
New structures and algorithms
Wider instruction issue
Need to understand impact of various classes of workloads
"Commercial" workloads
....
Moreno et al., 01/26/98
1-2
Superscalar performance exploration
The MET: Microarchitecture Exploration Toolset
Collection of tools for exploration of microarchitecture features
Trace-driven and execution-driven tools
Fast simulation: >300 M inst/hour
Intended to support early exploration of processor organizations
detailed model of generalized pipeline
trends among results instead of their magnitudes
Processor organization
NFA/Branch
L1-I cache
Predictor
I-Fetch
I-TLB
I-Buffer
I-Prefetch
Decode/
Expand
L2 cache
Rename/
Dispatch
Main
Memory
Cast-out
Issue queue
Integer
Issue queue
Load/store
Issue queue
Float.Point
Issue queue
Branch
Issue logic
Issue logic
Issue logic
Issue logic
Reg. read
Reg. read
Reg. read
Reg. read
Integer
units
Load/store
units
Floating-Point
units
Branch
units
queue
L1-D cache
Load/store
reorder buffer
D-TLB
Store
queue
TLB2
Moreno et al., 01/26/98
Miss
queue
Retirement
queue
Retirement
logic
3-4
Superscalar performance exploration
Pipeline stages
Integer
Load
Floating
point
Fetch
Decode Rename
Expand Dispatch
Issue
Read
Exec
Fetch
Decode Rename
Expand Dispatch
Issue
Read
EA
Fetch
Decode Rename
Expand Dispatch
Issue
Read
Exec1
WB
Retire
Dcache access
Exec2
Exec3
WB
Retire
WB
Retire
OLTP and GCC traces
OLTP
Length
Branch instructions
Branches taken
Instrs. in kernel space
Memory access instructions
Load/store multiple instructions
String instructions
Load/store w/update instrs.
Average block size
Mispredicted instructions, addresses
Moreno et al., 01/26/98
172 M instructions,
user and kernel space
18.9 %
44.3 %
22.1 %
34.8 %
1.6 %
1.4 %
1.7 %
5.3 instrs.
No
GCC
1212 M instructions,
user space
21.6 %
56.3 %
n/a
27.1 %
1.6 %
0.3 %
2.8 %
Yes
5-6
Superscalar performance exploration
CPI adder for various processor configurations
CPI adder
CPI
1.2
Finite/non-perfect
Infinite/perfect
1.0
0.8
0.6
0.4
0.2
0.0
OLTP
gcc95
Much larger adder in the case of OLTP workload
Miss rates (per 1000 instructions): 64K L1, 2M L2, 128 entries TLB, 8K entries BHT
OLTP
I1
21.3
8.3
I2
3.1
0.02
D1
22.4
9.8
1.8
1.8
0.02
~0
4.5
5.3 %
1.5
7.0 %
D2
I TLB
D TLB
Conditional branch misprediction
Moreno et al., 01/26/98
GCC
7-8
Superscalar performance exploration
Exploration space (in this presentation)
Issue policy
Width
Fetch/Dispatch/Retire
Class-order
Cache size
L1-I, L1-D, L2
Branch prediction
4/4/6 64K, 64K, 2M
Out-of-order
8192 entry BHT,
4096 BTAC
8/8/12 128K, 128K, 4M
Perfect
12/12/16 128K, 128K, Inf
Inf, Inf, Inf
Widths
Units
Ports
Queues
Physical
registers
Fetch/
FX/FP/LS/BR Data cache Issue/
GPR/FPR/CCR/SPR
Dispatch/
and TLB
Retire/
Retire
IBuf
4/4/6
3/2/2/2
2
20(12)/128/24
80/80/32/64
8/8/12
6/4/4/4
4
40/160/48
128/128/64/96
12/12/16
8/4/6/4
6
60/160/72
128/128/64/96
Other parameters (examples)
Sizes
I-prefetch buffer (entries)
Miss queue, cast-out queue (entries)
Store queue, reorder buffer (entries)
D/I-TLBs (entries)
TLB2 (entries)
L1-I/D, L2-cache line size (bytes)
Page size (bytes)
Latencies
Branch prediction
I-prefetch buffer latency (cycles)
D/I-TLBs miss penalty (cycles)
TLB2 miss penalty (cycles)
L1-I/D cache miss penalty (cycles)
L2 cache miss penalty (cycles)
BTAC (entries)
LR stack size (entries)
Branch history table (entries)
Moreno et al., 01/26/98
4
8
31
128
1024
128
4096
1
4
40
8. 7
40
4096
32
8192
9-10
Superscalar performance exploration
CPI adders due to issue policy (as % of base case)
CPI
CPI
1.2
15
33 37
29
32
4StBp
8LgBp
12LgBp
60 70
4LgBp
4InfBp
8StPf
12StPf
4StPf
8LgPf
35 40
8IL2Bp
21
16
12IL2Bp
40 47
12LgPf
4LgPf
8IL2Pf
12IL2Pf
4IL2Pf
8InfPf
0.0
4InfPf
0.5
45 53
34 38
99 125
12InfPf
35
16
4IL2Bp
21
18
8InfBp
21
12InfBp
1.0
8StBp
Class-order
Out-of-order
12StBp
OLTP
1.5
GCC
Class-order
Out-of-order
0.8
74
30
73
127 154
152 209
30
51 59
48
53
12StBp
8StBp
4StBp
12InfBp
8InfBp
4InfBp
12StPf
8StPf
4StPf
12InfPf
8InfPf
0.0
4InfPf
0.4
CPI adders due to branch prediction (as % of base case)
CPI
OLTP
18
o8St
o12St
88
107
o12St
o4St
26
o8St
c12St
c8St
28
CPI
GCC
1.2
Imperfect
Perfect
1.0
0.8
20
22
24
c4St
33
o12Lg
27
18
19
o8Lg
21
o4IL2
c12IL2
o12Inf
c8IL2
54
c4IL2
42
19
c12Lg
21
15
17
c8Lg
18
26
o8Inf
c8Inf
0.0
c4Inf
0.5
16
o4Inf
14
c12Inf
13
14
o12IL2
15
o8IL2
1.0
o4Lg
Imperfect
Perfect
c4Lg
1.5
21
22
19
23
25
24
58
63
0.6
103 143
0.4
Moreno et al., 01/26/98
o4St
c12St
c8St
c4St
o12Inf
o8Inf
o4Inf
c12Inf
c8Inf
0.0
c4Inf
0.2
11-12
Superscalar performance exploration
CPI adders due to cache size (as % of base case)
CPI
OLTP
1.5
St
Lg
IL2
Inf
1.0
31
38
o12Bp
66
o8Bp
o4Bp
92
c12Bp
79
c8Bp
o4Pf
c12Pf
c8Pf
c4Pf
36
60
0.5
0.0
35
44
c4Bp
31
o12Pf
31
o8Pf
29
CPI
GCC
1.2
St
Inf
1.0
0.8
0.6
0.4
o12Bp
o8Bp
o4Bp
c12Bp
c8Bp
c4Bp
o12Pf
o8Pf
o4Pf
c8Pf
c4Pf
0.0
c12Pf
0.2
CPI adders due to processor width (as % of base case)
CPI
OLTP
1.5
w=4
w=8
w=12
1.0
14
52
CPI
GCC
1.2
w=4
w=8
1.0
0.8
0.6
oStBp
cStBp
cLgBp
oLgBp
oIL2Bp
cIL2Bp
cInfBp
oInfBp
cStPf
oStPf
oLgPf
cLgPf
oIL2Pf
cIL2Pf
oInfPf
71
cInfPf
0.0
23
27
30
15
10
10
12
27
32
37
16
0.5
13
13
7
9
11
10
0.4
w=12
23
27
54
59
Moreno et al., 01/26/98
oStBp
cStBp
cInfBp
oInfBp
cStPf
oStPf
oInfPf
0.0
cInfPf
0.2
13-14
Superscalar performance exploration
In OLTP workload
"Least-aggressive" configurations considered
15 to 32% degradation due to class-order issue
more severe degradation expected for in-order policy
15 to 26% degradation due to imperfect branch predictor
30 to 66% degradation due to finite L1 cache (128K)
10 to 23% degradation due to processor width
Diminishing benefits beyond dispatching eight operations per cycle
conventional instruction fetching mechanism
Still many microarchitecture issues to investigate in detail
Observations
Clear differences in OLTP behavior relative to GCC
memory penalties in OLTP shadow other effects
Caveats due to use of traces
length
number of traces (just one in this presentation)
observability
in OLTP
no mispredicted paths
time scaling
in GCC
no kernel code
Moreno et al., 01/26/98
15-16
Superscalar performance exploration
Summary
Environment for early exploration
fast, flexible
trends among aggressive superscalar organizations
Behavior of OLTP workload
very different from others (i.e., SPEC)
different microarchitecture tradeoffs
Aggressive superscalar
buildable?
need to quantify potential performance from realizable implementation
need to identify/develop features that provide better return
CPI results in OLTP workload
Issue policy
c: Class-order
o: Out-of-order
Moreno et al., 01/26/98
Width
Bp: 2-bit branch history table Pf: Perfect branch predictor
(8192 entries)
Inf
IL2
Lg
St
Inf
IL2
Lg
St
4 0.82 1.07 1.18 1.29 0.72 0.93 1.03 1.12
8
0.71
0.96
1.07
1.18
0.62
0.81
0.91
1.00
12
0.70
0.95
1.06
1.17
0.60
0.79
0.89
0.97
4
0.67
0.93
1.02
1.12
0.53
0.77
0.86
0.95
8
0.44
0.71
0.81
0.91
0.31
0.56
0.65
0.75
12
0.41
0.68
0.77
0.88
0.27
0.51
0.60
0.70
17-18