Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 1 Overview 1. background: extensible processors 2. design flow: C to custom processor silicon 3. instruction selection: bandwidth/area constraints 4. application-specific processor synthesis 5. results: 3x area delay product reduction 6. current and future work + summary 2 3 1. Instruction-set extensible processors ● base processor + custom logic – partition data-flow graphs into custom instructions Register File data in ALU data out 4 Previous work ● ● many techniques, e.g. – Atasu et al. (DAC 03, CODES 05) – Biswas et al. (DATE 05), Cong et al. (FPGA 04) – Clark et al. (MICRO 03, HOT CHIPS 04) – Goodwin and Petkov (CASES 03) – Sun et al. (ICCAD 03), Yu et al. (CASES 04) current challenges – optimality and robustness of heuristics – complete tool chain: application to silicon – research infrastructure for custom processor design 5 2. Custom processor research at Imperial ● focus on effective optimization techniques – ● complete tool-chain – ● high-level descriptions to custom processor silicon open infrastructure for research in – – ● e.g. Integer Linear Programming (ILP) custom processor synthesis automatic customization techniques current tools – optimizing compiler (Trimaran) for custom CPUs – custom processor synthesis tool 6 Application to custom processor flow Application Area Source (C) Constraint Processor Description Template Generation Generate Custom Unit ASIC Tools Template Selection Generate Base CPU Area, Timing 7 Custom instruction model Input Register input ports Register File Pipeline Register output ports Output Register 8 3. Optimal instruction identification ● ● ● minimize schedule length of program data flow graphs (DFGs) subject to constraints – convexity: ensure feasible schedules – fixed processor critical path: pipeline for multi-cycle instructions – fixed data bandwidth: limited by register file ports steps: based on Integer Linear Pogramming (ILP) a. template generation b. template selection 9 a. Template generation 1. Solve ILP for DFG to generate a template 2. Collapse template to a single DFG node X 3. Repeat while (objective > 0) 10 b. Template selection ● ● determine isomorphism classes – find templates that can be implemented using the same instruction – calculate speed-up potential of each class solve Knapsack problem using ILP – maximize speedup within area constraint 11 Optimizing compilation flow Application in C/C++ VHDL Impact Front-end a) Template Generation CDFG Formation Data Bandwidth Constraints Elcor Backend Synopsys Synthesis Gain Area Data Bandwidth Constraints b) Template Selection Area Constraints MDES Generation Instruction Replacement Scheduling, Reg. Allocation Assembly Code and Statistics 12 4. Application-specific processor synthesis ● design space exploration framework – – ● prototype: MIPS integer instruction set – – ● Processor Component Library specialized structural description custom instructions flexible micro-architecture evaluate using actual implementation – timing and area 13 Processor synthesis flow from compiler merging ● add state registers ● processor interface ● Custom Data paths Processor Component Library FE Custom Processor interface ● data in/out ● stall control pipeline description ● parameters ● FE EX M W 14 Implementation ● ● based on Python scripts – structural meta-language for processors – combine RTL (Verilog/VHDL) IP blocks – module generators for custom units generate 100s of designs automatically – ASIC processor cores – complete system on FPGA: CPU + memory + I/O 15 5. Results ● cryptography benchmarks: C source – ● ● AES decrypt, AES encrypt, DES, MD5, SHA 4/5 stage pipelined MIPS base processor – 0.225mm2 area, 200 MHz clock speed – single issue processor – register file with 2 input ports, 1 output port processors synthesized to 130nm library – Synopsys DC and Cadence SoC Encounter – also synthesize to Xilinx FPGA for testing 16 AES Decryption Processor 130nm CMOS 200MHz 0.307mm2 35% area cost (mostly one instruction) 76% cycle reduction 17 AES Decryption Processor 130nm CMOS 200MHz 0.307mm2 35% area cost (mostly one instruction) 76% cycle reduction 18 Execution time AES decrypt AES encrypt DES MD5 SHA 120 N o rm a lis e d n u m b e r o f C y c le s Register file in all cases: 2 input ports, 1 output port 100 4 inputs, 1 output 80 4 inputs, 2 outputs 43% reduction 60 4 inputs, 4 outputs 63% reduction 40 4 inputs, 1 output 20 4 inputs, 1 output 76% reduction 0 0 10 20 30 40 50 60 70 Area constraint (ripple carry adders) 19 Timing • 48% of designs meet timing at 200MHz without manual optimization 0.5 0 50000 60000 70000 80000 90000 100000 110000 120000 Slack/ns -0.5 -1 -1.5 -2 -2.5 -3 -3.5 Cell area/µ m2 20 Area (for maximum speedup) 500000 93% 450000 Cell area 400000 Area/µm2 350000 300000 42% 35% 28% aes dec aes enc Chip area 23% 250000 200000 150000 100000 50000 0 des md5 sha base cpu 21 6. Current and future work ● support memory access in custom instructions – – ● use architectural techniques e.g. shadow registers – ● multiple issue vs custom instructions extend compiler: e.g. ILP model for cyclic graphs – ● improve bandwidth without additional register file ports study trade-offs for VLIW style – ● automate data partitioning for memory access automate SIMD load/store instructions for state registers adapt software pipelining for hardware more applications: multimedia, multi-core, reconfig. – include power consumption etc in optimization 22 Summary ● complete flow from C to custom processor ● automatic instruction set extension – – ● application-specific processor synthesis – ● complete flow: permits real hardware evaluation up to 76% reduction in execution cycles – ● based on integer linear programming optimize schedule length under constraints 3x area delay product reduction max speedup: 23% to 93% area overhead 23