AMD’s Next Generation Microprocessor Architecture Fred Weber October 2001 "Hammer" Goals • Build a next-generation system architecture which serves as the foundation for future processor platforms • Enable a full line of server and workstation products – Leading edge x86 (32-bit) performance and compatibility – Native 64-bit support – Establish x86-64 Instruction Set Architecture – Extensive Multiprocessor support – RAS features • Provide top-to-bottom desktop and mobile processors 2 Agenda • x86-64™ Technology • "Hammer" Architecture • "Hammer" System Architecture 3 x86-64™ Technology Why 64-Bit Computing? • Required for large memory programs – Large databases – Scientific and Engineering Problems • Designing CPUs J • But, – Limited Demand for Applications which require 64 bits • Most applications can remain 32-bit x86 instructions, if the processor continues to deliver leading edge x86 performance • And, – Software is a huge investment (tool chains, applications, certifications) – Instruction set is first and foremost a vehicle for compatibility • Binary compatibility • Interpreter/JIT support is increasingly important 5 x86-64 Instruction Set Architecture • x86-64 mode built on x86 – Similar to the previous extension from 16-bit to 32bit – Vast majority of opcodes and features unchanged – Integer/Address register files and datapaths are native 64-bit – 48-Bit Virtual Address Space, 40-Bit Physical Address Space • Enhancements – Add 8 new integer registers – Add PC relative addressing – Add full support for SSE/SSEII based Floating Point Application Binary Interface (ABI) • including 16 registers – Additional Registers and Data Size added through reclaim of one byte increment/decrement opcodes (0x40-0x4F) for use as a single optional prefix • Public specification – www.x86-64.org 6 x86-64 Programmer’s Model In x86 Added by x86-64 127 S S E & S S E 2 XMM0 XMM0 XMM7 XMM7 XMM8 XMM8 63 31 RAX EAX EAX 0 G P R 15 EAX EAX 7 0 AH AL AL 79 0 x 8 7 EDI EDI R8 R8 Program Counter XMM15 XMM15 R15 R15 63 31 0 EIP EIP 7 X86-64 Code Generation and Quality • Compiler and Tool Chain is a straight forward port • Instruction set is designed to offer all the advantages of CISC and RISC – Code density of CISC – Register usage and ABI models of RISC – Enables easy application of standard compiler optimizations • SpecInt2000 Code Generation (compared to 32 bit x86) – Code size grows <10% • Due mostly to instruction prefixes – Static Instruction Count SHRINKS by 10% – Dynamic Instruction Count SHRINKS by at least 5% – Dynamic Load/Store Count SHRINKS by 20% – All without any specific code optimizations 8 x86-64™ Summary • Processor is fully x86 capable – Full native performance with 32-bit applications and OS – Full compatibility (BIOS, OS, Drivers) • Flexible deployment – Best-in-class 32-bit, x86 performance – Excellent 64-bit, x86-64 instruction execution when needed • Server, Workstation, Desktop, and Mobile share same architecture – OS, Drivers and Applications can be the same – CPU vendors focus not split, ISV focus not split – Support, optimization, etc. all designed to be the same 9 The "Hammer" Architecture The “Hammer” Architecture DDR Memory Controller “Hammer” Processor Core L1 Instruction Cache L2 Cache L1 Data Cache HyperTransport™ .... 11 Processor Core Overview Instr’n TLB 2k Branch Targets Level 1 Instr’n Cache 16k History Counter Fetch 2 - transit Pick Level 2 Cache L2 ECC L2 Tags L2 Tag ECC System Request Queue (SRQ) Cross Bar (XBAR) Memory Controller & HyperTransport™ Decode 1 Decode 1 Decode 1 Decode 2 Decode 2 Decode 2 Pack Pack Pack Decode Decode Decode 8-entry Scheduler 8-entry Scheduler 8-entry Scheduler AGU AGU AGU ALU ALU Data TLB ALU RAS & Target Address 36-entry Scheduler FADD Level 1 Data Cache FMUL FMISC ECC 12 Processor Core Overview Instr’n TLB 2k Branch Targets Level 1 Instr’n Cache 16k History Counter Fetch 2 - transit Pick Level 2 Cache L2 ECC L2 Tags L2 Tag ECC System Request Queue (SRQ) Cross Bar (XBAR) Memory Controller & HyperTransport™ Decode 1 Decode 1 Decode 1 Decode 2 Decode 2 Decode 2 Pack Pack Pack Decode Decode Decode 8-entry Scheduler 8-entry Scheduler 8-entry Scheduler AGU AGU AGU ALU ALU Data TLB ALU RAS & Target Address 36-entry Scheduler FADD Level 1 Data Cache FMUL FMISC ECC 13 Processor Core Overview Instr’n TLB 2k Branch Targets Level 1 Instr’n Cache 16k History Counter Fetch 2 - transit Pick Level 2 Cache L2 ECC L2 Tags L2 Tag ECC System Request Queue (SRQ) Cross Bar (XBAR) Memory Controller & HyperTransport™ Decode 1 Decode 1 Decode 1 Decode 2 Decode 2 Decode 2 Pack Pack Pack Decode Decode Decode 8-entry Scheduler 8-entry Scheduler 8-entry Scheduler AGU AGU AGU ALU ALU Data TLB ALU RAS & Target Address 36-entry Scheduler FADD Level 1 Data Cache FMUL FMISC ECC 14 "Hammer" Pipeline 1 Fetch 7 8 12 13 Exec L2 19 20 DRAM 32 15 Fetch/Decode Pipeline 1 7 8 12 13 Fetch Fetch Fetch Fetch 11 Fetch Fetch 22 Exec Pick Pick Decode Decode 11 L2 19 20 Decode Decode 22 Pack Pack Pack/Decode Pack/Decode DRAM 32 16 Execute Pipeline 1 Fetch 7 8 12 13 Exec Exec L2 Dispatch Dispatch Schedule Schedule AGU/ALU AGU/ALU Data Data Cache Cache 11 1 ns Data Data Cache Cache 22 19 20 DRAM 32 17 L2 Pipeline 1 Fetch 7 8 12 13 Exec L2 L2 19 20 L2 L2 Request Request Address Address to to L2 L2 Tag Tag L2 L2 Tag Tag 1 ns L2 L2 Tag, Tag, L2 L2 Data Data L2 L2 Data Data 5 ns Data Data From From L2 L2 DRAM Data Data to to DC DC MUX MUX Write Write L1, L1, Forward Forward 32 18 DRAM Pipeline 1 Fetch 7 8 12 13 Exec L2 19 20 DRAM 32 L2 Request Address to L2 Tag L2 Tag L2 Tag, L2 Data L2 Data Data from L2 Data to DC MUX Write L1, Forward AAddddr reessss t too NNBB CCl ol occkk BBoouunnddaar ryy SSRRQQ LLooaadd SSRRQQ SScchheedduul el e GGAARRTT//AAddddr rMMaapp CCAAMM GGAARRTT//AAddddr rMMaapp RRAAMM XXBBAARR CCoohheer reennccee//OOr rddeer r CChheecckk MMCCTT SScchheedduul el e DDRRAAMM CCmmdd QQ LLooaadd DDRRAAMM PPaaggee SSt taat tuuss CChheecckk DDRRAAMM CCmmdd QQ SScchheedduul el e RReeqquueesst t t too DDRRAAMM PPi n i nss ……. . DDRRAAMM AAcccceessss PPi n i nss t o t o MMCCTT TThhr roouugghh NNBB CCl ol occkk BBoouunnddaar ryy AAccr roossss CCPPUU EECCCC aanndd MMUUXX WWr ri ti tee DDCC 1 ns 5 ns 12 ns 19 Large Workload Branch Prediction • Sequential Fetch L2 Cache Branch Selectors Evicted Data • Predicted Fetch Branch Selectors Global History Counter (16k, 2-bit counters) Target Array (2k targets) 12-entry Return Address Stack (RAS) • Branch Target Address Calculator Fetch Execution stages • Mispredicted Fetch Branch Target Address Calculator (BTAC) 20 Large Workload TLBs CR3, PDP, PDE Probe Modify ASN ASN VA PA Flush Filter CAM 32 Entry L1 Instruction TLB 40 Entry Fully Associative 4M/2M & 4k pages ASN Current ASN Table Walk L2 Instruction TLB 512-entry 4-way associative TLB Reload 24-entry Page Descriptor Cache PDP, PDE L2 Data Cache VA PA Port 0, L1 Data TLB 40 Entry Fully Associative 4M/2M & 4k pages ASN VA PA Port 1, L1 Data TLB 40 Entry Fully Associative 4M/2M & 4k pages L2 Data TLB 512-entry 4-way associative TLB Reload PDC Reload 21 DDR Memory Controller • Integrated Memory Controller Details – Memory controller details • 8 or 16-byte interface • 16-Byte interface supports – Direct connection to 8 registered DIMMs – Chipkill ECC • Unbuffered or Registered DIMMs • PC1600, PC2100, and PC2700 DDR memory • Integrated Memory Controller Benefits – Significantly reduces DRAM latency – Memory latency improves • as CPU and HyperTransport™ link speed improves – Bandwidth and capacity grows with number of CPUs – Snoop probe throughput scales with CPU frequency 22 Reliability and Availability • L1 Data Cache ECC Protected • L2 Cache AND Cache Tags ECC Protected • DRAM ECC Protected – With Chipkill ECC support • On Chip and off Chip ECC Protected Arrays include background hardware scrubbers • Remaining arrays parity protected – L1 Instruction Cache, TLBs, Tags – Generally read only data which can be recovered • Machine Check Architecture – Report failures and predictive failure results – Mechanism for hardware/software error containment and recovery 23 HyperTransport™ Technology • Next-generation computing performance goes beyond the microprocessor • Screaming I/O for chip-to-chip communication – – – – High bandwidth Reduced pin count Point-to-point links Split transaction and full duplex • Open standard – Industry enabler for building high bandwidth I/O subsystems – I/O subsystems: PCI-X, G-bit Ethernet, Infiniband, etc. • Strong Industry Acceptance – 100+ companies evaluating specification & several licensing technologies through AMD (2000) – First HyperTransport technology-based south bridge announced by nVIDIA (June 2001) • Enables scalable 2-8 processor SMP systems – Glueless MP 24 CPU With Integrated Northbridge DRAM HT* XBAR HT* HT*-HB XBAR CPU I/O HT*-HB MCT HT* I/O CPU HT* I/O HT*-HB HyperTransport™ Link SRQ MCT SRQ DRAM I/O HT* = HyperTransport™ technology HB = Host Bridge HT* HT* XBAR HT* XBAR MCT DRAM CPU SRQ CPU HT* SRQ I/O HT*-HB Coherent HyperTransport MCT DRAM 25 Northbridge Overview CPU 0 Data CPU 1 Data CPU 0 Probes CPU 1 CPU 0 Probes Requests CPU 1 Requests CPU 0 Int CPU 1 Int System Request Queue (SRQ) Advanced Priority Interrupt Controller (APIC) Crossbar (XBAR) Memory Controller (MCT) 64-bit Data 64-bit Command/Address DRAM Controller (DCT) 16-bit Data/Command/Address HyperTransport™ HyperTransport Link 0 Link 2 HyperTransport Link 1 RAS/CAS/Cntl DRAM Data 26 Northbridge Command Flow CPU 0 Victim Buffer (8-entry) Write Buffer (4-entry) Instruction MAB (2-entry) Data MAB (8-entry) All buffers are 64-bit command/address CPU 1 to CPU System Request Queue 24-entry Address MAP & GART Router HyperTransport™ Link 0 Input HyperTransport Link 1 Input HyperTransport Link 2 Input Router Router Router Memory Command Queue 20-entry to DCT Router 10-entry Buffer 16-entry Buffer 16-entry Buffer 16-entry Buffer 12-entry Buffer XBAR HyperTransport Link 0 Output HyperTransport Link 1 Output HyperTransport Link 2 Output 27 Northbridge Data Flow CPU 0 All buffers are 64-byte cache lines Victim Buffer (8-entry) Write Buffer (4-entry) from Host Bridge CPU 1 HyperTransport™ Link 0 input 8-entry Buffer HyperTransport Link 1 input 8-entry Buffer HyperTransport Link 2 input 8-entry Buffer XBAR HyperTransport Link 0 output from DCT HyperTransport Link 1 output 5-entry Buffer 8-entry Buffer XBAR HyperTransport Link 2 output System Request Data Queue 12-entry to CPU to Host Bridge Memory Data Queue 8-entry to DCT 28 Coherent HyperTransport™ Read Request Step 1 I/O Memory 1 I/O CPU 3 Memory 1 CPU 2 Read Cache Line CPU 0 Memory 1 CPU 1 I/O I/O Memory 1 29 Coherent HyperTransport™ Read Request Step 2 I/O Memory 1 I/O CPU 3 Memory 1 CPU 2 Read Cache Line 1: RdBlk CPU 0 Memory 1 CPU 1 I/O I/O Memory 1 30 Coherent HyperTransport™ Read Request Step 3 I/O Memory 1 Read Cache Line I/O Probe Request 2 CPU 3 CPU 2 2: RdBlk Probe Request 3 Probe Request 0 1: RdBlk CPU 0 Memory 1 Memory 1 CPU 1 I/O I/O Memory 1 31 Coherent HyperTransport™ Read Request Step 4 I/O Memory 1 3: RdBlk I/O Memory 1 Probe Response 3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 3: PRQ0 1: RdBlk CPU 0 CPU 1 Probe Request 1 Memory 1 I/O I/O Memory 1 32 Coherent HyperTransport™ Read Request Step 5 I/O Memory 1 I/O Memory 1 3: RdBlk Read Response 4: TRSP3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 3: PRQ0 1: RdBlk Probe Response 3 4: PRQ1 CPU 0 CPU 1 Probe Response 0 Memory 1 I/O I/O Memory 1 33 Coherent HyperTransport™ Read Request Step 6 I/O Memory 1 I/O Memory 1 3: RdBlk Read Response 5: RDRSP 4: TRSP3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 3: PRQ0 1: RdBlk 5: TRSP3 Probe Response 2 4: PRQ1 CPU 0 CPU 1 5: TRSP0 Memory 1 I/O I/O Memory 1 34 Coherent HyperTransport™ Read Request Step 7 I/O Memory 1 I/O Memory 1 3: RdBlk 5: RDRSP 6: RDRSP 4: TRSP3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 3: PRQ0 1: RdBlk 5: TRSP3 Read Response 6: TRSP2 4: PRQ1 CPU 0 CPU 1 5: TRSP0 Memory 1 I/O I/O Memory 1 35 Coherent HyperTransport™ Read Request Step 8 I/O Memory 1 I/O Memory 1 3: RdBlk 5: RDRSP 6: RDRSP 4: TRSP3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 7: RDRSP 5: TRSP3 1: RdBlk 3: PRQ0 6: TRSP2 4: PRQ1 5: TRSP0 CPU 0 CPU 1 Source Done Memory 1 I/O I/O Memory 1 36 Coherent HyperTransport™ Read Request Step 9 I/O Memory 1 I/O Memory 1 3: RdBlk 5: RDRSP 6: RDRSP 4: TRSP3 CPU 3 CPU 2 2: RdBlk 3: PRQ3 3: PRQ2 Source Done 7: RDRSP 5: TRSP3 1: RdBlk 3: PRQ0 6: TRSP2 4: PRQ1 5: TRSP0 CPU 0 Memory 1 CPU 1 9: SrcDn I/O I/O Memory 1 37 "Hammer" Architecture Summary • 8th Generation microprocessor core – Improved IPC and operating frequency – Support for large workloads • Cache subsystem – Enhanced TLB structures – Improved branch prediction • Integrated DDR memory controller – Reduced DRAM latency • HyperTransport™ technology – Screaming I/O for chip-to-chip communication – Enables glueless MP 38 "Hammer" System Architecture “Hammer” System Architecture 1-way "Hammer" "Hammer" 8x AGP HyperTransport™ HyperTransport™ AGP AGP Int Gfx Southbridge Southbridge 40 “Hammer” System Architecture Glueless Multiprocessing: 2-way 8x AGP "Hammer" "Hammer" "Hammer" "Hammer" HyperTransport™ HyperTransport™ AGP AGP HyperTransport HyperTransport PCI-X PCI-X Southbridge Southbridge 41 “Hammer” System Architecture Glueless Multiprocessing: 4-way "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" HyperTransport™ HyperTransport™ AGP AGP HyperTransport HyperTransport PCI-X PCI-X HyperTransport HyperTransport PCI-X PCI-X Southbridge Southbridge AGP optional 8x AGP 42 “Hammer” System Architecture Glueless Multiprocessing: 8-way “Hammer” “Hammer” "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" "Hammer" 43 MP System Architecture • Software view of memory is SMP – Physical address space is flat and fully coherent – Latency difference between local and remote memory in an 8P system is comparable to the difference between a DRAM page hit and DRAM page conflict – DRAM location can be contiguous or interleaved • Multiprocessor support designed in from the beginning – Lower overall chip count – All MP system functions use CPU technology and frequency • 8P System parameters – 64 DIMMs (up to 128GB) directly connected – 4 HyperTransport links available for IO (25GB/s) 44 The Rewards of Good Plumbing • Bandwidth – 4P system designed to achieve 8GB/s aggregate memory copy bandwidth • With data spread throughout system – Leading edge bus based systems limited to about 2.1GB/s aggregate bandwidth (3.2GB/s theoretical peak) • Latency – Average unloaded latency in 4P system (page miss) is designed to be 140ns – Average unloaded latency in 8P system (page miss) is designed to be 160ns – Latency under load planned to increase much more slowly than bus based systems due to available bandwidth – Latency shrinks quickly with increasing CPU clock speed and HyperTransport link speed 45 "Hammer" Summary • 8th generation CPU core – Delivering high-performance through an optimum balance of IPC and operating frequency • x86-64™ technology – Compelling 64-bit migration strategy without any significant sacrifice of existing code base – Full speed support for x86 code base – Unified architecture from notebook through server • DDR memory controller – Significantly reduces DRAM latency • HyperTransport™ technology – High-bandwidth I/O – Glueless MP • Foundation for future portfolio of processors – Top-to-bottom desktop and mobile processors – High-performance 1-, 2-, 4-, and 8-way servers and workstations 46 ©2001 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, 3DNow! And combinations thereof are trademarks of Advanced Micro Devices. HyperTransport is a trademark of the HyperTransport Technology Consortium. Other product names are for informational purposes only and may be trademarks of their respective companies. 47