Kilo-instructions Processors Speaker: Mateo Valero, UPC Team: Adrián Cristal, Pepe Martínez, Josep Llosa, Daniel Ortega and Mateo Valero IBM Seminar on Compilers and Architecture Haifa November 11th. 2003 1 2 Processor-DRAM Gap (latency) CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 10 1 µProc 60%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 Time D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998 3 The situation has changed 3 . 4 ' 0 / / , , , . . / 0/ , / 0/ , 12 3 0 . 4 0 ! + , 5 2 5 2 " # " $ % $ & ' (" + ) % * ( $ ) . $ " (% & ,! $ + ( ! 4 Memory Latency and IPC in-flight inst. , 38% , latency , IPC 60% , , , int way + fp way int way fp way & 5 Reducing Memory Latency & 6 7 7 9 , 8 6 , ! , , ( 6 Outline + 9 4 ( / & < 1 ( 6( ; , & / - 6 / & : 14 ,= 7 Kilo-instructions processors , , 4 ? @= )" 93 " %* 9 3 " * 9 A 3 % MemPerf - perfect , , IPC + latency - branch pred. MemPerf - perceptron 17% , 44% , 26% 22% - , , , ROB >5 6 ! , 8 perfect perceptron perfect perceptron - perfect - perceptron Kilo-instructions processors: specfp2000 , , , MemPerf - perfect 85% MemPerf - perceptron IPC , 38% , 250% , , 230% latencia - pred.saltos - perfect - perceptron - perfect - perceptron - perfect - perceptron , , ROB ( /< 3 ! 9 Scalability , 6 4 , , B ' 6 , , 6 B 6 B / = ( ,! 6 Integer Reg. File , FP Reg. File 5 , IQ & LSQ /< 3 6 - (6 FPQ (< , 6 Reorder Buffer , 4 & 6 , (6 6 10 Utilization of the Resources + 9 4 ( / & < 1 ( 6( ; , & / - 6 / & : 14 ,= 11 Instructions in-flight (FP, ROB=2048) 10% 25% 50% 75% 90% 1607 1868 Number of In-flight Instructions (SpecFP) 1955 2000 1800 Often nearly full Number of In-flight Instructions 1600 1400 1200 1000 800 600 400 200 0 1168 1382 2034 12 Instructions in-flight (Int, ROB=2048) 10% 25% 50% 75% 90% 435 1004 Number of In-flight Instructions (SpecInt) 1361 2000 Number of In-flight Instructions 1800 1600 1400 Branches 1200 1000 800 600 400 200 0 20 108 1756 13 State of Registers (FP, ROB=2048) 1168 1400 Number of Instructions 1607 1868 1955 Dead Blocked-Long Blocked-Short Live 1200 1000 FP Registers 1382 1 800 / 600 ' 400 200 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions 14 State of Registers (Int, ROB=2048) 1000 900 800 10% 25% 50% 75% Dead Blocked-Long Blocked-Short Live 700 Int. Registers 90% 1 / 600 500 400 C 300 / 200 100 0 20 108 435 1004 Number of In-flight Instructions (SpecInt) 1361 1756 15 State of FP Queues (specFP, ROB=2048) Number of Instructions 1168 600 1382 1607 1868 1955 Blocked-Long Blocked-Short Ready 500 FP Queue 400 300 D / . 200 0 0 4 >/ , & 100 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 16 State of Int Queues (specInt, ROB=2048) INT Number of Instructions 20 450 108 435 1004 1361 Blocked-Long Blocked-Short Ready 400 350 Int. Queue 300 250 D 200 / . 150 0 0 4 >/ , & 100 50 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 17 State of LD Queues (specFP, ROB=2048) FP Number of Instructions 1168 600 Dead Blocked-Long Blocked-Short Replayable Live 500 400 LD Queue 1382 1607 1868 Checkpointing 1 300 1955 / 200 100 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 18 State of LD Queues (specInt, ROB=2048) INT Number of Instructions 20 450 435 1004 1361 Dead Blocked-Long Blocked-Short Replayable Live 400 350 300 LD Queue 108 250 200 Checkpointing 1 / 150 100 50 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 19 State of ST Queues (specFP, ROB=2048) FP Number of Instructions 1168 300 1382 1607 1868 1955 Ready Address Ready Blocked-Long Blocked-Short 250 ST Queue 200 + 150 / ! , 100 50 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 20 State of ST Queues (specInt, ROB=2048) INT Number of Instructions 20 250 435 1004 1361 Ready Address Ready Blocked-Long Blocked-Short 200 ST Queue 108 150 + 100 / ! , 50 0 1 10 25 50 Distribution of in-flight Instructions 75 90 100 21 Out-of-order Commit Processors + 9 4 ( / & < 1 ( 6( ; , & / - 6 / & : 14 ,= 22 Checkpointing Instructions ) Oldest !" # !" !" # # & Instruction Flow ) $ ( & $ % !" # & Newest ' 23 Checkpointing Instructions ) !" # !" !" # # ( & ** A Oldest & & Instruction Flow ) ( & % !" !" # # $ + , & Newest ' 24 Checkpointing Instructions ( & ** Oldest ( Instruction Flow !" +- ( . ! "& & % / # , * # " !" # & 0 Newest 25 Checkpoint Information & 6 , & & ! 6 4 + ! 4 + 3 66 , , , , 6 B ! , ,D , ! 26 Ephemeral registers + 9 4 ( / & < 1 ( 6( ; , & / - 6 / & : 14 ,= 27 Virtual-Physical Registers + 4 Icache & Decode&Rename 4 Register Unused C 1 ( Register Used Register Unused Register Used Register Unused / / Register Unused 1 Commit Register Used / Register Used 28 Early Release + Virtual Registers I 1 / 5 6 , 4 6 2 $ & 5 , , 6 $ = , , ,8 , , , ,8 , , , ,0 $ 8 , 7 E 6 ,F , , , 0 , $ &' + $ ' ! 8 ! ! 6 , 6 , 6 , 6 ; , 0 29 Early Release + Virtual Registers II C / . 5 / 6 ! , 0 6 C 8 6 / 0 $ C &' + + / ! 01 / ,, , 6 0 CAM Maps Tables Virtual Registers # Logical register # Physical register Map Table … Map Table N Pending Counter 30 Instruction Queues + 9 4 ( / & < 1 ( 6( ; , & / - 6 / & : 14 ,= 31 Instruction Queues Oldest Load/Store Queue Ld Pseudo Rob x Instruction Flow x Dat a Dependence x Instruction Queue a LD a b x x x Slow Line Instruction Queue b Newest x 32 Instruction Queues Oldest Load/Store Queue Ld x Instruction Flow x Dat a Dependence x Instruction Queue a b a x Pseudo Rob LD x x Slow Line Instruction Queue b Newest x 33 Instruction Queues ) Oldest Load/Store Queue Ld x Instruction Flow x Dat a Dependence x LD Instruction Queue & a a b x x x Slow Line Instruction Queue b x Pseudo Rob Newest 34 Performance Evaluation & 6 D& G3 , 3 , . " ; H @ *9 ) " 9 !8 @( ) " 9 !8 @( % " 9 !8 @( A 8) " ! 8) " ! 8* @ ! 8" 8" 8 + / / ,D , + 3 66 A : + D. 4 : : D. 4D B : @ @ @ @ " @ " * * * G G G G D D D D D H )D , " D" H "D H @D 8 " D " 8 " @D" @H 35 Slow-Lane Instruction Queue . Baseline IPC . COoO COoO COoO Baseline . Baseline . Slow Lane Instruction Queue 36 SLIQ to IQ cycles delay and Performance Degradation . . . 37 Length of the Virtual ROB In-flight Instructions Baseline COoO COoO COoO Baseline Baseline Slow Lane Instruction Queue 38 Status of the removed instructions % % % % % % % Stores Long Lat. Loads Finished Loads Short Lat. Finished Moved % % % % 39 Number of Checkpointers and Performance . . IPC Limit . . 40 Memory Latency, SLIQ size and Performance . Limit Limit Limit Baseline IPC . . Baseline Baseline . Slow Lane Instruction Queue Memory Latency 41 Putting All Together . Physical Registers Limit Limit Limit IPC . Baseline . Baseline Baseline . Virtual Registers Memory Latency IQs of 128 entries Virtual Tags Memory Latency 42 Memory Latency I , 0/ = , ! 0 8! , + 2 F, 4 , , 8 0' 8 4 6 ,! , , . /' + & &' (" @8 J , '0 ! , , , 90 0E K 8+ 0 + , , 6 , 66 , 0 "* ( "? 8 0 E , F+ ; 888F () 8 , .0 & , ? E3 ; B F 111( &8 43 Large Reorder Buffers A0 8 03 %0 10 / ! & ' (" @8 70' ? /0 3 ! $ , 8- 0 I J ,+0. 8L 0 ; , $ ' + , , I0 , 8 , .0 ' ! ! ,, 8&0 = " F 8 )0 & ' 8I () 8 " 0 2 , L 05 0 , 8C 0 , 0 ' ,4 E: , 6 , F ''8 J 8 0. , 8 0 , 87 0 F 3+ I " "0 F+ 0E. F ,6 & ' (" " 8 E , , ,6 0/ F E' , 8 0. ! 8I 0 4 4 7 &' ( 8 ! %(" %8 I ! E+ 4 <0+ I0 + 0 050C K 6 E/ , 2 ( 6( , ' F 2 4 8 , 30 6/ , ! E ,. 4 @ 8 44 0 Checkpointing = 0+ 0 7 &' ( @8 $ 1 , L0 5 0 ?J0 + < 1 ) %8 " $ $ , 2 F 4 8+ 0 C F I0 0 + 70' ( 6( 6/ '0& $ $ 6 & / $ $ $ 0F& & ( 60 , 8 , I0 0E / 4 /< 3 ! : &(. ' &(" " () 8 I " "0 & 6 , M ;8 I 0 / , "0 8 + 0&0 7 8+ 0 , 4 < 1 , 4 4 8 , I0 ( 6( , 0& 0 + &/ < ( 6 8/ 0 / K , ! ,D '0& B 0 0 0E C = E& , F+ ,/ () * 8 " ) 4 / < 3 N 00F 45 Instruction Queues 0 85 0 0 I F $ . 4, ! 8I 0 9 86 " 8" " 0 10 3 , , 8I 0 / , F 8. 0 < F 8 + , ,8 !0 " 4 (/ 6 ( 66 4 < B , 8 , 10 / F ! E' &' ( 6 8 &0= &' () %8 " 8 '0& $ / $ J0 2 + ,, $ E& 8 0 8I 0 , 6 4 (/ ! , I 01 0 & ' (" @8 B ' 0/ 0 $ / $ 9 8 D! 8I 0 @ / 8 "0 , 8 6 ,+0C , 30 3 D : &(. ' &(" E7 6 E< ( 6(< , & ) (@@8 I " ) 0 7 &' ( + 46 Register File +0+ , , 90 , 0C 4 , E/ , F8 + $ 1 / 8' 0 A 0+ / 8I 0 A ( "* " " (" ) 8 ;O ;8 C 0 C P / F )) / 8I 0 + / $ !# 8 , # ( / (" #1 8 6 0 6 M ;8 + 0 C & )0 E. + $ C '0 & 6 8 6/ ;O ;8 + 0 C C ' ,, )( / , I0 ) % 8: 4 # 0 E1 / 8" F0 )0 6 47 Load-Store Queues '0 & 8+ 0 C ) 8I " 8 ( 6( 6 ,D /< 3 ! : &(. ' &(" "( , 8 + 0&0 7 8+ 0 4 , F + & / < () %8 " 6 8/ 0 / K ,/ = , $ 4 / , M ;8 I 0 / E& , $ 1 70' 0E "0 $ 1 I0 0 + , I0 F 4 8 , I0 "0 , , 4 0 0 4 , F+ E& ! () * 8 " ) B 48 Conclusion ' 66 & , ! < 1 ( 6( E9 ( F ( , ( 4 B 1 6 ,D 5 , B , 3 6 ! , , & / , 6 ,+ , ! ( 2 6 49 4 50 Acknowledgments ', O & I Q+ M ; I . < I L A + 7 ,/ / 51 Ephemeral Registers 52 . . . Ephemeral Ephemeral * Cherry Late Allocation Normal . Ephemeral Ephemeral * Cherry Late Allocation Normal . . . Physical Registers . . . Physical Registers 53 Relative IPC . Relative IPC .