ETC PRESENTATION

AMD’s
Next Generation
Microprocessor
Architecture
Fred Weber
October 2001
"Hammer" Goals
• Build a next-generation system architecture
which serves as the foundation for future
processor platforms
• Enable a full line of server and workstation
products
– Leading edge x86 (32-bit) performance and
compatibility
– Native 64-bit support
– Establish x86-64 Instruction Set Architecture
– Extensive Multiprocessor support
– RAS features
• Provide top-to-bottom desktop and mobile
processors
2
Agenda
• x86-64™ Technology
• "Hammer" Architecture
• "Hammer" System Architecture
3
x86-64™ Technology
Why 64-Bit Computing?
• Required for large memory programs
– Large databases
– Scientific and Engineering Problems
• Designing CPUs J
• But,
– Limited Demand for Applications which require 64
bits
• Most applications can remain 32-bit x86 instructions, if
the processor continues to deliver leading edge x86
performance
• And,
– Software is a huge investment (tool chains,
applications, certifications)
– Instruction set is first and foremost a vehicle for
compatibility
• Binary compatibility
• Interpreter/JIT support is increasingly important
5
x86-64 Instruction Set Architecture
• x86-64 mode built on x86
– Similar to the previous extension from 16-bit to 32bit
– Vast majority of opcodes and features unchanged
– Integer/Address register files and datapaths are
native 64-bit
– 48-Bit Virtual Address Space, 40-Bit Physical
Address Space
• Enhancements
– Add 8 new integer registers
– Add PC relative addressing
– Add full support for SSE/SSEII based Floating Point
Application Binary Interface (ABI)
• including 16 registers
– Additional Registers and Data Size added through
reclaim of one byte increment/decrement opcodes
(0x40-0x4F) for use as a single optional prefix
• Public specification
– www.x86-64.org
6
x86-64 Programmer’s Model
In x86
Added by x86-64
127
S
S
E
&
S
S
E
2
XMM0
XMM0
XMM7
XMM7
XMM8
XMM8
63
31
RAX
EAX
EAX
0
G
P
R
15
EAX
EAX
7
0
AH AL
AL
79
0
x
8
7
EDI
EDI
R8
R8
Program
Counter
XMM15
XMM15
R15
R15
63
31
0
EIP
EIP
7
X86-64 Code Generation and Quality
• Compiler and Tool Chain is a straight forward port
• Instruction set is designed to offer all the
advantages of CISC and RISC
– Code density of CISC
– Register usage and ABI models of RISC
– Enables easy application of standard compiler
optimizations
• SpecInt2000 Code Generation
(compared to 32 bit x86)
– Code size grows <10%
• Due mostly to instruction prefixes
– Static Instruction Count SHRINKS by 10%
– Dynamic Instruction Count SHRINKS by at least 5%
– Dynamic Load/Store Count SHRINKS by 20%
– All without any specific code optimizations
8
x86-64™ Summary
• Processor is fully x86 capable
– Full native performance with 32-bit applications and
OS
– Full compatibility (BIOS, OS, Drivers)
• Flexible deployment
– Best-in-class 32-bit, x86 performance
– Excellent 64-bit, x86-64 instruction execution when
needed
• Server, Workstation, Desktop, and Mobile share
same architecture
– OS, Drivers and Applications can be the same
– CPU vendors focus not split, ISV focus not split
– Support, optimization, etc. all designed to be the
same
9
The "Hammer"
Architecture
The “Hammer” Architecture
DDR Memory
Controller
“Hammer”
Processor
Core
L1
Instruction
Cache
L2
Cache
L1
Data
Cache
HyperTransport™
....
11
Processor Core Overview
Instr’n
TLB
2k
Branch
Targets
Level 1 Instr’n Cache
16k
History
Counter
Fetch 2 - transit
Pick
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport™
Decode 1
Decode 1
Decode 1
Decode 2
Decode 2
Decode 2
Pack
Pack
Pack
Decode
Decode
Decode
8-entry
Scheduler
8-entry
Scheduler
8-entry
Scheduler
AGU
AGU
AGU
ALU
ALU
Data
TLB
ALU
RAS
&
Target Address
36-entry
Scheduler
FADD
Level 1 Data Cache
FMUL
FMISC
ECC
12
Processor Core Overview
Instr’n
TLB
2k
Branch
Targets
Level 1 Instr’n Cache
16k
History
Counter
Fetch 2 - transit
Pick
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport™
Decode 1
Decode 1
Decode 1
Decode 2
Decode 2
Decode 2
Pack
Pack
Pack
Decode
Decode
Decode
8-entry
Scheduler
8-entry
Scheduler
8-entry
Scheduler
AGU
AGU
AGU
ALU
ALU
Data
TLB
ALU
RAS
&
Target Address
36-entry
Scheduler
FADD
Level 1 Data Cache
FMUL
FMISC
ECC
13
Processor Core Overview
Instr’n
TLB
2k
Branch
Targets
Level 1 Instr’n Cache
16k
History
Counter
Fetch 2 - transit
Pick
Level 2
Cache
L2 ECC
L2 Tags
L2 Tag ECC
System Request
Queue (SRQ)
Cross Bar
(XBAR)
Memory Controller
&
HyperTransport™
Decode 1
Decode 1
Decode 1
Decode 2
Decode 2
Decode 2
Pack
Pack
Pack
Decode
Decode
Decode
8-entry
Scheduler
8-entry
Scheduler
8-entry
Scheduler
AGU
AGU
AGU
ALU
ALU
Data
TLB
ALU
RAS
&
Target Address
36-entry
Scheduler
FADD
Level 1 Data Cache
FMUL
FMISC
ECC
14
"Hammer" Pipeline
1
Fetch
7
8
12
13
Exec
L2
19
20
DRAM
32
15
Fetch/Decode Pipeline
1
7
8
12
13
Fetch
Fetch
Fetch
Fetch 11
Fetch
Fetch 22
Exec
Pick
Pick
Decode
Decode 11
L2
19
20
Decode
Decode 22
Pack
Pack
Pack/Decode
Pack/Decode
DRAM
32
16
Execute Pipeline
1
Fetch
7
8
12
13
Exec
Exec
L2
Dispatch
Dispatch
Schedule
Schedule
AGU/ALU
AGU/ALU
Data
Data Cache
Cache 11
1 ns
Data
Data Cache
Cache 22
19
20
DRAM
32
17
L2 Pipeline
1
Fetch
7
8
12
13
Exec
L2
L2
19
20
L2
L2 Request
Request
Address
Address to
to L2
L2 Tag
Tag
L2
L2 Tag
Tag
1 ns
L2
L2 Tag,
Tag, L2
L2 Data
Data
L2
L2 Data
Data
5 ns
Data
Data From
From L2
L2
DRAM
Data
Data to
to DC
DC MUX
MUX
Write
Write L1,
L1, Forward
Forward
32
18
DRAM Pipeline
1
Fetch
7
8
12
13
Exec
L2
19
20
DRAM
32
L2 Request
Address to L2 Tag
L2 Tag
L2 Tag, L2 Data
L2 Data
Data from L2
Data to DC MUX
Write L1, Forward
AAddddr reessss t too NNBB
CCl ol occkk BBoouunnddaar ryy
SSRRQQ LLooaadd
SSRRQQ SScchheedduul el e
GGAARRTT//AAddddr rMMaapp CCAAMM
GGAARRTT//AAddddr rMMaapp RRAAMM
XXBBAARR
CCoohheer reennccee//OOr rddeer r CChheecckk
MMCCTT SScchheedduul el e
DDRRAAMM CCmmdd QQ LLooaadd
DDRRAAMM PPaaggee SSt taat tuuss CChheecckk
DDRRAAMM CCmmdd QQ SScchheedduul el e
RReeqquueesst t t too DDRRAAMM PPi n
i nss
……. . DDRRAAMM AAcccceessss
PPi n
i nss t o
t o MMCCTT
TThhr roouugghh NNBB
CCl ol occkk BBoouunnddaar ryy
AAccr roossss CCPPUU
EECCCC aanndd MMUUXX
WWr ri ti tee DDCC
1 ns
5 ns
12 ns
19
Large Workload Branch Prediction
• Sequential Fetch
L2 Cache
Branch
Selectors
Evicted
Data
• Predicted Fetch
Branch
Selectors
Global
History
Counter
(16k, 2-bit
counters)
Target Array
(2k targets)
12-entry
Return Address
Stack (RAS)
• Branch Target
Address Calculator
Fetch
Execution
stages
• Mispredicted Fetch
Branch
Target
Address
Calculator
(BTAC)
20
Large Workload TLBs
CR3, PDP, PDE
Probe Modify
ASN
ASN
VA
PA
Flush Filter
CAM
32 Entry
L1 Instruction TLB
40 Entry
Fully Associative
4M/2M & 4k pages
ASN
Current ASN
Table Walk
L2 Instruction TLB
512-entry
4-way associative
TLB
Reload
24-entry
Page Descriptor
Cache
PDP, PDE
L2 Data Cache
VA
PA
Port 0, L1 Data TLB
40 Entry
Fully Associative
4M/2M & 4k pages
ASN
VA
PA
Port 1, L1 Data TLB
40 Entry
Fully Associative
4M/2M & 4k pages
L2 Data TLB
512-entry
4-way associative
TLB
Reload
PDC Reload
21
DDR Memory Controller
• Integrated Memory Controller Details
– Memory controller details
• 8 or 16-byte interface
• 16-Byte interface supports
– Direct connection to 8 registered DIMMs
– Chipkill ECC
• Unbuffered or Registered DIMMs
• PC1600, PC2100, and PC2700 DDR memory
• Integrated Memory Controller Benefits
– Significantly reduces DRAM latency
– Memory latency improves
• as CPU and HyperTransport™ link speed improves
– Bandwidth and capacity grows with number of CPUs
– Snoop probe throughput scales with CPU frequency
22
Reliability and Availability
• L1 Data Cache ECC Protected
• L2 Cache AND Cache Tags ECC Protected
• DRAM ECC Protected
– With Chipkill ECC support
• On Chip and off Chip ECC Protected Arrays include
background hardware scrubbers
• Remaining arrays parity protected
– L1 Instruction Cache, TLBs, Tags
– Generally read only data which can be recovered
• Machine Check Architecture
– Report failures and predictive failure results
– Mechanism for hardware/software error containment
and recovery
23
HyperTransport™ Technology
• Next-generation computing performance goes beyond the
microprocessor
• Screaming I/O for chip-to-chip communication
–
–
–
–
High bandwidth
Reduced pin count
Point-to-point links
Split transaction and full duplex
• Open standard
– Industry enabler for building high bandwidth I/O subsystems
– I/O subsystems: PCI-X, G-bit Ethernet, Infiniband, etc.
• Strong Industry Acceptance
– 100+ companies evaluating specification & several licensing
technologies through AMD (2000)
– First HyperTransport technology-based south bridge announced
by nVIDIA (June 2001)
• Enables scalable 2-8 processor SMP systems
– Glueless MP
24
CPU With Integrated Northbridge
DRAM
HT*
XBAR
HT*
HT*-HB
XBAR
CPU
I/O
HT*-HB
MCT
HT*
I/O
CPU
HT*
I/O
HT*-HB
HyperTransport™ Link
SRQ
MCT
SRQ
DRAM
I/O
HT* = HyperTransport™ technology
HB = Host Bridge
HT*
HT*
XBAR
HT*
XBAR
MCT
DRAM
CPU
SRQ
CPU
HT*
SRQ
I/O
HT*-HB
Coherent
HyperTransport
MCT
DRAM
25
Northbridge Overview
CPU 0
Data
CPU 1
Data
CPU 0
Probes
CPU 1
CPU 0
Probes Requests
CPU 1
Requests
CPU 0
Int
CPU 1
Int
System
Request
Queue
(SRQ)
Advanced
Priority
Interrupt
Controller
(APIC)
Crossbar
(XBAR)
Memory
Controller
(MCT)
64-bit Data
64-bit
Command/Address
DRAM
Controller
(DCT)
16-bit
Data/Command/Address
HyperTransport™
HyperTransport
Link 0
Link 2
HyperTransport
Link 1
RAS/CAS/Cntl
DRAM
Data
26
Northbridge Command Flow
CPU 0
Victim Buffer (8-entry)
Write Buffer (4-entry)
Instruction MAB (2-entry)
Data MAB (8-entry)
All buffers are 64-bit
command/address
CPU 1
to
CPU
System Request
Queue
24-entry
Address MAP
& GART
Router
HyperTransport™
Link 0 Input
HyperTransport
Link 1 Input
HyperTransport
Link 2 Input
Router
Router
Router
Memory
Command
Queue
20-entry
to
DCT
Router
10-entry Buffer 16-entry Buffer 16-entry Buffer 16-entry Buffer 12-entry Buffer
XBAR
HyperTransport
Link 0 Output
HyperTransport
Link 1 Output
HyperTransport
Link 2 Output
27
Northbridge Data Flow
CPU 0
All buffers are 64-byte cache
lines
Victim Buffer (8-entry)
Write Buffer (4-entry)
from Host
Bridge
CPU 1
HyperTransport™
Link 0 input
8-entry Buffer
HyperTransport
Link 1 input
8-entry Buffer
HyperTransport
Link 2 input
8-entry Buffer
XBAR
HyperTransport
Link 0 output
from DCT
HyperTransport
Link 1 output
5-entry Buffer
8-entry Buffer
XBAR
HyperTransport
Link 2 output
System
Request
Data Queue
12-entry
to CPU
to Host
Bridge
Memory
Data Queue
8-entry
to DCT
28
Coherent HyperTransport™
Read Request
Step 1
I/O
Memory 1
I/O
CPU 3
Memory 1
CPU 2
Read Cache Line
CPU 0
Memory 1
CPU 1
I/O
I/O
Memory 1
29
Coherent HyperTransport™
Read Request
Step 2
I/O
Memory 1
I/O
CPU 3
Memory 1
CPU 2
Read Cache Line
1: RdBlk
CPU 0
Memory 1
CPU 1
I/O
I/O
Memory 1
30
Coherent HyperTransport™
Read Request
Step 3
I/O
Memory 1
Read Cache Line
I/O
Probe Request 2
CPU 3
CPU 2
2: RdBlk
Probe Request 3
Probe Request 0
1: RdBlk
CPU 0
Memory 1
Memory 1
CPU 1
I/O
I/O
Memory 1
31
Coherent HyperTransport™
Read Request
Step 4
I/O
Memory 1
3: RdBlk
I/O
Memory 1
Probe Response 3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
3: PRQ0
1: RdBlk
CPU 0
CPU 1
Probe Request 1
Memory 1
I/O
I/O
Memory 1
32
Coherent HyperTransport™
Read Request
Step 5
I/O
Memory 1
I/O
Memory 1
3: RdBlk
Read Response
4: TRSP3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
3: PRQ0
1: RdBlk
Probe Response 3
4: PRQ1
CPU 0
CPU 1
Probe Response 0
Memory 1
I/O
I/O
Memory 1
33
Coherent HyperTransport™
Read Request
Step 6
I/O
Memory 1
I/O
Memory 1
3: RdBlk
Read Response
5: RDRSP
4: TRSP3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
3: PRQ0
1: RdBlk 5: TRSP3
Probe Response 2
4: PRQ1
CPU 0
CPU 1
5: TRSP0
Memory 1
I/O
I/O
Memory 1
34
Coherent HyperTransport™
Read Request
Step 7
I/O
Memory 1
I/O
Memory 1
3: RdBlk
5: RDRSP
6: RDRSP
4: TRSP3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
3: PRQ0
1: RdBlk 5: TRSP3
Read Response
6: TRSP2
4: PRQ1
CPU 0
CPU 1
5: TRSP0
Memory 1
I/O
I/O
Memory 1
35
Coherent HyperTransport™
Read Request
Step 8
I/O
Memory 1
I/O
Memory 1
3: RdBlk
5: RDRSP
6: RDRSP
4: TRSP3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
7: RDRSP
5:
TRSP3
1: RdBlk
3: PRQ0
6: TRSP2
4: PRQ1
5: TRSP0
CPU 0
CPU 1
Source Done
Memory 1
I/O
I/O
Memory 1
36
Coherent HyperTransport™
Read Request
Step 9
I/O
Memory 1
I/O
Memory 1
3: RdBlk
5: RDRSP
6: RDRSP
4: TRSP3
CPU 3
CPU 2
2: RdBlk
3: PRQ3
3: PRQ2
Source Done
7: RDRSP
5:
TRSP3
1: RdBlk
3: PRQ0
6: TRSP2
4: PRQ1
5: TRSP0
CPU 0
Memory 1
CPU 1
9: SrcDn
I/O
I/O
Memory 1
37
"Hammer" Architecture Summary
• 8th Generation microprocessor core
– Improved IPC and operating frequency
– Support for large workloads
• Cache subsystem
– Enhanced TLB structures
– Improved branch prediction
• Integrated DDR memory controller
– Reduced DRAM latency
• HyperTransport™ technology
– Screaming I/O for chip-to-chip communication
– Enables glueless MP
38
"Hammer" System
Architecture
“Hammer” System Architecture
1-way
"Hammer"
"Hammer"
8x
AGP
HyperTransport™
HyperTransport™
AGP
AGP
Int
Gfx
Southbridge
Southbridge
40
“Hammer” System Architecture
Glueless Multiprocessing: 2-way
8x
AGP
"Hammer"
"Hammer"
"Hammer"
"Hammer"
HyperTransport™
HyperTransport™
AGP
AGP
HyperTransport
HyperTransport
PCI-X
PCI-X
Southbridge
Southbridge
41
“Hammer” System Architecture
Glueless Multiprocessing: 4-way
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
HyperTransport™
HyperTransport™
AGP
AGP
HyperTransport
HyperTransport
PCI-X
PCI-X
HyperTransport
HyperTransport
PCI-X
PCI-X
Southbridge
Southbridge
AGP optional
8x
AGP
42
“Hammer” System Architecture
Glueless Multiprocessing: 8-way
“Hammer”
“Hammer”
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
"Hammer"
43
MP System Architecture
• Software view of memory is SMP
– Physical address space is flat and fully coherent
– Latency difference between local and remote
memory in an 8P system is comparable to the
difference between a DRAM page hit and DRAM page
conflict
– DRAM location can be contiguous or interleaved
• Multiprocessor support designed in from the
beginning
– Lower overall chip count
– All MP system functions use CPU technology and
frequency
• 8P System parameters
– 64 DIMMs (up to 128GB) directly connected
– 4 HyperTransport links available for IO (25GB/s)
44
The Rewards of Good Plumbing
• Bandwidth
– 4P system designed to achieve 8GB/s aggregate
memory copy bandwidth
• With data spread throughout system
– Leading edge bus based systems limited to about
2.1GB/s aggregate bandwidth (3.2GB/s theoretical
peak)
• Latency
– Average unloaded latency in 4P system (page miss)
is designed to be 140ns
– Average unloaded latency in 8P system (page miss)
is designed to be 160ns
– Latency under load planned to increase much more
slowly than bus based systems due to available
bandwidth
– Latency shrinks quickly with increasing CPU clock
speed and HyperTransport link speed
45
"Hammer" Summary
• 8th generation CPU core
– Delivering high-performance through an optimum balance of
IPC and operating frequency
• x86-64™ technology
– Compelling 64-bit migration strategy without any significant
sacrifice of existing code base
– Full speed support for x86 code base
– Unified architecture from notebook through server
• DDR memory controller
– Significantly reduces DRAM latency
• HyperTransport™ technology
– High-bandwidth I/O
– Glueless MP
• Foundation for future portfolio of processors
– Top-to-bottom desktop and mobile processors
– High-performance 1-, 2-, 4-, and 8-way servers and
workstations
46
©2001 Advanced Micro Devices, Inc.
AMD, the AMD Arrow logo, 3DNow! And
combinations thereof are trademarks of Advanced
Micro Devices. HyperTransport is a trademark of the
HyperTransport Technology Consortium. Other
product names are for informational purposes only
and may be trademarks of their respective
companies.
47