Presentation

Kilo-instructions Processors
Speaker: Mateo Valero, UPC
Team: Adrián Cristal, Pepe Martínez, Josep Llosa, Daniel Ortega and Mateo Valero
IBM Seminar on Compilers and Architecture
Haifa November 11th. 2003
1
2
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
100
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
3
The situation has changed
3
. 4
'
0
/
/
,
,
,
.
.
/ 0/ ,
/ 0/ ,
12
3 0
. 4
0
!
+
,
5 2
5 2
"
#
"
$ %
$
& ' ("
+
) %
*
(
$ )
.
$ " (%
&
,!
$ +
(
!
4
Memory Latency and IPC
in-flight inst.
,
38%
,
latency
,
IPC
60%
,
,
,
int way
+
fp way
int way
fp way
&
5
Reducing Memory Latency
&
6
7
7
9
,
8
6
,
!
,
,
(
6
Outline
+
9
4
(
/
&
<
1
( 6(
;
,
&
/
-
6
/
&
:
14
,=
7
Kilo-instructions processors
,
,
4
?
@=
)" 93
" %* 9 3 "
* 9
A
3
%
MemPerf - perfect
,
,
IPC
+
latency - branch pred.
MemPerf - perceptron
17%
,
44%
,
26%
22%
-
,
,
,
ROB
>5
6
!
,
8
perfect
perceptron
perfect
perceptron
- perfect
- perceptron
Kilo-instructions processors: specfp2000
,
,
,
MemPerf - perfect
85%
MemPerf - perceptron
IPC
,
38%
,
250%
,
,
230%
latencia - pred.saltos
- perfect
- perceptron
- perfect
- perceptron
- perfect
- perceptron
,
,
ROB
(
/< 3
!
9
Scalability
,
6 4
,
,
B
'
6
,
,
6
B
6
B
/
=
(
,! 6
Integer Reg. File
,
FP Reg. File
5
,
IQ
&
LSQ
/< 3
6
-
(6
FPQ
(< ,
6
Reorder Buffer
,
4
& 6
,
(6
6
10
Utilization of the Resources
+
9
4
(
/
&
<
1
( 6(
;
,
&
/
-
6
/
&
:
14
,=
11
Instructions in-flight (FP, ROB=2048)
10%
25%
50%
75%
90%
1607
1868
Number of In-flight Instructions (SpecFP)
1955
2000
1800
Often nearly full
Number of In-flight Instructions
1600
1400
1200
1000
800
600
400
200
0
1168
1382
2034
12
Instructions in-flight (Int, ROB=2048)
10%
25%
50%
75%
90%
435
1004
Number of In-flight Instructions (SpecInt)
1361
2000
Number of In-flight Instructions
1800
1600
1400
Branches
1200
1000
800
600
400
200
0
20
108
1756
13
State of Registers (FP, ROB=2048)
1168
1400
Number of Instructions
1607
1868
1955
Dead
Blocked-Long
Blocked-Short
Live
1200
1000
FP Registers
1382
1
800
/
600
'
400
200
0
1
10
25
50
75
90
100
Distribution of in-flight Instructions
14
State of Registers (Int, ROB=2048)
1000
900
800
10%
25%
50%
75%
Dead
Blocked-Long
Blocked-Short
Live
700
Int. Registers
90%
1
/
600
500
400
C
300
/
200
100
0
20
108
435
1004
Number of In-flight Instructions (SpecInt)
1361
1756
15
State of FP Queues (specFP, ROB=2048)
Number of Instructions
1168
600
1382
1607
1868
1955
Blocked-Long
Blocked-Short
Ready
500
FP Queue
400
300
D
/
.
200
0
0
4 >/
,
&
100
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
16
State of Int Queues (specInt, ROB=2048)
INT
Number of Instructions
20
450
108
435
1004
1361
Blocked-Long
Blocked-Short
Ready
400
350
Int. Queue
300
250
D
200
/
.
150
0
0
4 >/
,
&
100
50
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
17
State of LD Queues (specFP, ROB=2048)
FP
Number of Instructions
1168
600
Dead
Blocked-Long
Blocked-Short
Replayable
Live
500
400
LD Queue
1382
1607
1868
Checkpointing
1
300
1955
/
200
100
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
18
State of LD Queues (specInt, ROB=2048)
INT
Number of Instructions
20
450
435
1004
1361
Dead
Blocked-Long
Blocked-Short
Replayable
Live
400
350
300
LD Queue
108
250
200
Checkpointing
1
/
150
100
50
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
19
State of ST Queues (specFP, ROB=2048)
FP
Number of Instructions
1168
300
1382
1607
1868
1955
Ready
Address Ready
Blocked-Long
Blocked-Short
250
ST Queue
200
+
150
/
! ,
100
50
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
20
State of ST Queues (specInt, ROB=2048)
INT
Number of Instructions
20
250
435
1004
1361
Ready
Address Ready
Blocked-Long
Blocked-Short
200
ST Queue
108
150
+
100
/
! ,
50
0
1
10
25
50
Distribution of in-flight Instructions
75
90
100
21
Out-of-order Commit Processors
+
9
4
(
/
&
<
1
( 6(
;
,
&
/
-
6
/
&
:
14
,=
22
Checkpointing Instructions
)
Oldest
!"
#
!"
!"
#
#
&
Instruction Flow
)
$
(
&
$
%
!"
#
&
Newest
'
23
Checkpointing Instructions
)
!"
#
!"
!"
#
#
(
& **
A
Oldest
&
&
Instruction Flow
)
(
&
%
!"
!"
#
#
$
+
,
&
Newest
'
24
Checkpointing Instructions
(
& **
Oldest
(
Instruction Flow
!"
+-
(
.
! "&
&
%
/
#
, *
# "
!"
#
& 0
Newest
25
Checkpoint Information
& 6
,
&
&
!
6
4
+
!
4
+
3 66
,
,
, ,
6
B
!
,
,D
,
!
26
Ephemeral registers
+
9
4
(
/
&
<
1
( 6(
;
,
&
/
-
6
/
&
:
14
,=
27
Virtual-Physical Registers
+
4
Icache
&
Decode&Rename
4
Register Unused
C
1
(
Register Used
Register Unused
Register Used
Register Unused
/
/
Register Unused
1
Commit
Register Used
/
Register Used
28
Early Release + Virtual Registers I
1
/
5
6
,
4
6
2
$ &
5
,
,
6
$ =
,
, ,8
,
,
,
,8
,
,
,
,0
$
8
,
7
E
6 ,F
,
,
,
0
,
$ &' +
$ '
! 8
!
! 6
, 6
, 6
,
6
;
,
0
29
Early Release + Virtual Registers II
C
/
.
5
/
6
!
,
0
6
C
8
6
/
0
$
C
&' + +
/
!
01
/
,, ,
6
0
CAM Maps Tables
Virtual Registers
# Logical register
# Physical register
Map Table
…
Map Table N
Pending Counter
30
Instruction Queues
+
9
4
(
/
&
<
1
( 6(
;
,
&
/
-
6
/
&
:
14
,=
31
Instruction Queues
Oldest
Load/Store
Queue
Ld
Pseudo
Rob
x
Instruction Flow
x
Dat a Dependence
x
Instruction
Queue
a
LD
a
b
x
x
x
Slow Line
Instruction
Queue
b
Newest
x
32
Instruction Queues
Oldest
Load/Store
Queue
Ld
x
Instruction Flow
x
Dat a Dependence
x
Instruction
Queue
a
b
a
x
Pseudo
Rob
LD
x
x
Slow Line
Instruction
Queue
b
Newest
x
33
Instruction Queues
)
Oldest
Load/Store
Queue
Ld
x
Instruction Flow
x
Dat a Dependence
x
LD
Instruction
Queue
&
a
a
b
x
x
x
Slow Line
Instruction
Queue
b
x
Pseudo
Rob
Newest
34
Performance Evaluation
&
6
D&
G3
,
3
,
.
"
;
H
@
*9
) " 9 !8 @(
) " 9 !8 @(
% " 9 !8 @(
A
8) " !
8) " !
8* @ !
8"
8"
8
+
/
/
,D
,
+
3 66
A
:
+
D. 4 :
:
D. 4D B :
@
@
@
@
"
@
"
*
*
*
G
G
G
G
D
D
D
D
D H
)D
, " D" H
"D H
@D 8 " D " 8 " @D" @H
35
Slow-Lane Instruction Queue
.
Baseline
IPC
.
COoO
COoO
COoO
Baseline
.
Baseline
.
Slow Lane Instruction Queue
36
SLIQ to IQ cycles delay and Performance Degradation
.
.
.
37
Length of the Virtual ROB
In-flight Instructions
Baseline
COoO
COoO
COoO
Baseline
Baseline
Slow Lane Instruction Queue
38
Status of the removed instructions
%
%
%
%
%
%
%
Stores
Long Lat. Loads
Finished Loads
Short Lat.
Finished
Moved
%
%
%
%
39
Number of Checkpointers and Performance
.
.
IPC
Limit
.
.
40
Memory Latency, SLIQ size and Performance
.
Limit
Limit
Limit
Baseline
IPC
.
.
Baseline
Baseline
.
Slow Lane Instruction Queue
Memory Latency
41
Putting All Together
.
Physical
Registers
Limit
Limit
Limit
IPC
.
Baseline
.
Baseline
Baseline
.
Virtual
Registers
Memory Latency
IQs of 128 entries
Virtual Tags
Memory Latency
42
Memory Latency
I
, 0/
=
,
!
0
8! ,
+ 2
F,
4
,
,
8 0'
8
4
6
,!
,
, . /' + &
&' (" @8
J
, '0 !
, ,
,
90
0E
K 8+ 0 +
,
,
6
, 66
,
0 "* ( "? 8
0
E
,
F+
;
888F
() 8
, .0 &
,
?
E3
;
B
F 111( &8
43
Large Reorder Buffers
A0
8 03
%0
10 /
!
& ' (" @8
70'
?
/0 3
!
$
,
8- 0 I
J
,+0.
8L 0
; ,
$ '
+
,
, I0
,
8 , .0 ' !
!
,,
8&0 =
"
F
8
)0
& ' 8I
() 8
"
0
2
, L 05 0
,
8C 0
, 0 ' ,4 E:
,
6
, F
''8
J
8 0. ,
8 0
, 87 0
F 3+ I
" "0
F+
0E.
F
,6
& ' (" " 8
E
, ,
,6
0/
F
E' ,
8 0.
!
8I 0
4
4
7 &' ( 8 !
%(" %8 I
!
E+
4
<0+
I0 + 0
050C K
6
E/
, 2
( 6( ,
'
F
2
4
8 , 30
6/
,
!
E
,. 4
@
8
44
0
Checkpointing
= 0+ 0 7
&' ( @8
$
1
, L0 5 0
?J0
+
<
1
) %8 "
$
$
,
2
F
4
8+ 0 C
F
I0 0 +
70'
( 6(
6/
'0&
$
$
6
&
/
$
$
$
0F&
&
( 60 ,
8
, I0
0E
/
4
/< 3 !
: &(. ' &("
" () 8 I
"
"0
&
6
,
M ;8 I 0 /
,
"0
8 + 0&0 7
8+ 0
,
4
<
1
,
4
4 8 , I0
( 6( ,
0&
0 + &/ < (
6
8/ 0 / K
,
!
,D
'0&
B
0 0
0E
C
=
E&
,
F+
,/
() * 8 " )
4
/ < 3 N 00F
45
Instruction Queues
0
85 0 0 I
F
$ . 4,
!
8I 0 9
86
" 8" " 0
10 3
,
,
8I 0 /
,
F
8. 0 <
F
8 + , ,8 !0 "
4 (/
6
( 66
4
< B
,
8
, 10 /
F
!
E'
&' (
6
8 &0=
&' () %8 "
8
'0&
$ /
$
J0
2
+
,,
$
E&
8 0 8I 0
,
6
4 (/
!
, I 01 0
& ' (" @8
B
' 0/ 0
$ /
$ 9
8
D!
8I 0
@
/
8
"0
,
8
6
,+0C
, 30 3
D
: &(. ' &("
E7
6
E<
( 6(< , &
) (@@8 I
" ) 0 7 &' (
+
46
Register File
+0+
,
, 90
,
0C
4
,
E/
,
F8
+
$ 1
/
8' 0 A
0+
/
8I 0 A
(
"*
" " (" ) 8
;O ;8 C 0 C P
/
F
))
/
8I 0 +
/
$ !#
8
,
#
( / ("
#1
8
6
0
6
M ;8 + 0 C
&
)0
E.
+
$ C
'0 &
6
8
6/
;O ;8 + 0 C
C
'
,,
)(
/
, I0
) % 8: 4
#
0 E1
/
8"
F0
)0
6
47
Load-Store Queues
'0 &
8+ 0 C
) 8I
"
8
( 6(
6
,D
/< 3 !
: &(. ' &("
"(
,
8 + 0&0 7
8+ 0
4
,
F + & / < () %8 "
6
8/ 0 / K
,/
= ,
$
4
/
,
M ;8 I 0 /
E&
,
$ 1
70'
0E
"0
$ 1
I0 0 +
, I0
F
4 8
, I0
"0
,
,
4
0 0
4
,
F+
E&
!
() * 8 "
)
B
48
Conclusion
' 66
&
, !
<
1
( 6(
E9
(
F
(
,
( 4
B
1
6
,D
5
,
B
,
3
6
!
,
,
&
/
,
6
,+
,
!
(
2
6
49
4
50
Acknowledgments
', O &
I Q+
M ;
I
.
<
I
L
A
+
7
,/
/
51
Ephemeral Registers
52
.
.
.
Ephemeral
Ephemeral *
Cherry
Late Allocation
Normal
.
Ephemeral
Ephemeral *
Cherry
Late Allocation
Normal
.
.
.
Physical Registers
.
.
.
Physical Registers
53
Relative IPC
.
Relative IPC
.