Towards Optimal Custom Instruction Processors

Towards Optimal
Custom Instruction Processors
Wayne Luk
Kubilay Atasu, Rob Dimond and Oskar Mencer
Department of Computing
Imperial College London
HOT CHIPS 18
1
Overview
1. background: extensible processors
2. design flow: C to custom processor silicon
3. instruction selection: bandwidth/area constraints
4. application-specific processor synthesis
5. results: 3x area delay product reduction
6. current and future work + summary
2
3
1. Instruction-set extensible processors
●
base processor + custom logic
–
partition data-flow graphs into custom instructions
Register
File
data in
ALU
data out
4
Previous work
●
●
many techniques, e.g.
–
Atasu et al. (DAC 03, CODES 05)
–
Biswas et al. (DATE 05), Cong et al. (FPGA 04)
–
Clark et al. (MICRO 03, HOT CHIPS 04)
–
Goodwin and Petkov (CASES 03)
–
Sun et al. (ICCAD 03), Yu et al. (CASES 04)
current challenges
–
optimality and robustness of heuristics
–
complete tool chain: application to silicon
–
research infrastructure for custom processor design
5
2. Custom processor research at Imperial
●
focus on effective optimization techniques
–
●
complete tool-chain
–
●
high-level descriptions to custom processor silicon
open infrastructure for research in
–
–
●
e.g. Integer Linear Programming (ILP)
custom processor synthesis
automatic customization techniques
current tools
–
optimizing compiler (Trimaran) for custom CPUs
–
custom processor synthesis tool
6
Application to custom processor flow
Application
Area
Source (C) Constraint
Processor
Description
Template
Generation
Generate
Custom
Unit
ASIC
Tools
Template
Selection
Generate
Base
CPU
Area,
Timing
7
Custom instruction model
Input Register
input ports
Register
File
Pipeline Register
output ports
Output Register
8
3. Optimal instruction identification
●
●
●
minimize schedule length of program
data flow graphs (DFGs)
subject to constraints
–
convexity: ensure feasible schedules
–
fixed processor critical path: pipeline for
multi-cycle instructions
–
fixed data bandwidth: limited by register file ports
steps: based on Integer Linear Pogramming (ILP)
a. template generation
b. template selection
9
a. Template generation
1. Solve ILP for DFG to
generate a template
2. Collapse template to
a single DFG node
X
3. Repeat while (objective > 0)
10
b. Template selection
●
●
determine isomorphism classes
–
find templates that can be implemented using the
same instruction
–
calculate speed-up potential of each class
solve Knapsack problem using ILP
–
maximize speedup within area constraint
11
Optimizing compilation flow
Application in C/C++
VHDL
Impact Front-end
a) Template
Generation
CDFG
Formation
Data Bandwidth
Constraints
Elcor Backend
Synopsys
Synthesis
Gain
Area
Data Bandwidth
Constraints
b) Template
Selection
Area
Constraints
MDES
Generation
Instruction
Replacement
Scheduling,
Reg. Allocation
Assembly Code
and Statistics
12
4. Application-specific processor synthesis
●
design space exploration framework
–
–
●
prototype: MIPS integer instruction set
–
–
●
Processor Component Library
specialized structural description
custom instructions
flexible micro-architecture
evaluate using actual implementation
–
timing and area
13
Processor synthesis flow
from compiler
merging
● add state registers
● processor interface
●
Custom
Data paths
Processor
Component Library
FE
Custom
Processor
interface
● data in/out
● stall control
pipeline description
● parameters
●
FE
EX
M
W
14
Implementation
●
●
based on Python scripts
–
structural meta-language for processors
–
combine RTL (Verilog/VHDL) IP blocks
–
module generators for custom units
generate 100s of designs automatically
–
ASIC processor cores
–
complete system on FPGA: CPU + memory + I/O
15
5. Results
●
cryptography benchmarks: C source
–
●
●
AES decrypt, AES encrypt, DES, MD5, SHA
4/5 stage pipelined MIPS base processor
–
0.225mm2 area, 200 MHz clock speed
–
single issue processor
–
register file with 2 input ports, 1 output port
processors synthesized to 130nm library
–
Synopsys DC and Cadence SoC Encounter
–
also synthesize to Xilinx FPGA for testing
16
AES Decryption Processor
130nm CMOS
200MHz
0.307mm2
35% area cost
(mostly one
instruction)
76% cycle
reduction
17
AES Decryption Processor
130nm CMOS
200MHz
0.307mm2
35% area cost
(mostly one
instruction)
76% cycle
reduction
18
Execution time
AES decrypt
AES encrypt
DES
MD5
SHA
120
N o rm a lis e d n u m b e r o f C y c le s
Register file in all cases: 2 input ports, 1 output port
100
4 inputs, 1 output
80
4 inputs, 2 outputs
43% reduction
60
4 inputs, 4 outputs
63% reduction
40
4 inputs, 1 output
20
4 inputs, 1 output
76% reduction
0
0
10
20
30
40
50
60
70
Area constraint (ripple carry adders)
19
Timing
• 48% of designs meet timing at 200MHz without
manual optimization
0.5
0
50000
60000
70000
80000
90000
100000
110000
120000
Slack/ns
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
Cell area/µ m2
20
Area (for maximum speedup)
500000
93%
450000
Cell area
400000
Area/µm2
350000
300000
42%
35%
28%
aes
dec
aes
enc
Chip area
23%
250000
200000
150000
100000
50000
0
des
md5
sha
base
cpu
21
6. Current and future work
●
support memory access in custom instructions
–
–
●
use architectural techniques e.g. shadow registers
–
●
multiple issue vs custom instructions
extend compiler: e.g. ILP model for cyclic graphs
–
●
improve bandwidth without additional register file ports
study trade-offs for VLIW style
–
●
automate data partitioning for memory access
automate SIMD load/store instructions for state registers
adapt software pipelining for hardware
more applications: multimedia, multi-core, reconfig.
–
include power consumption etc in optimization
22
Summary
●
complete flow from C to custom processor
●
automatic instruction set extension
–
–
●
application-specific processor synthesis
–
●
complete flow: permits real hardware evaluation
up to 76% reduction in execution cycles
–
●
based on integer linear programming
optimize schedule length under constraints
3x area delay product reduction
max speedup: 23% to 93% area overhead
23