Accurate Fine-Grained Processor Power Proxies ∗∗ , Alper Buyuktosunoglu∗, Wei Huang†, Charles Lefurgy∗, William Kuk∗∗,‡ Michael Floyd∗∗, Karthick Rajamani∗, Malcolm Allen-Ware∗, Bishop Brock∗∗ †AMD, ∗IBM Research, ‡Purdue University, ∗∗IBM System and Technology Group The 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012 Situation Solution Practical ways to directly measure power consumption of a core within a microprocessor do not exist. Determine idle power model. Determine active power model. Servers commonly measure power consumption of chips at the voltage regulator. Power management could be improved by providing accurate, finegrain power measurements for individual processor cores. On idle chip, sweep voltage and frequency (253 measurement points) Run training kernel workload (V=Vnom, F=Fnom) Core power measurement would enable virtual machine energy billing and forecasting the effects of power management actuators. <Power,Voltage,Frequency, Temp> x 253 x 4 chips Sense circuitry Vdd rail (12 V) Bulk Power Supply Voltage regulator A/D Real power measurement for system management. Genetic Algorithm-based Optimization Find fitting parameters to minimize Pmeasure - Pidle for all measurement points POWER7+ chip Find weights to minimize Pactive - ActivityProxy/R Fitting parameters # , %, _ , ', ( EnergyScale microcontroller 51 weights , " Active power model Idle power model Opportunity POWER7 includes on-chip hardware to compute per-chiplet activity proxies used to estimate active power. Activity counters and the calculation of activity proxies are implemented in hardware logic of each core. Instead of implementing a weighted sum, some weights are applied to groups of activity counters to reduce circuit area. ! " ! " DFU FXU ,-./ 1 1 ( 3 4 3 ! " ! " • Trained with 762 kernels, spanning a range of memory sizes and threading modes. • Idle power model has accuracy of 3% across voltage and frequency range. • Fit using 4 chips from distinct process corners. Results • Unsigned error is 1.8% (2.0% std dev) across all tested workloads (only SPEC CPU2006 shown). 1.1 Experiment • Calibrate POWER7+ power proxy hardware. • Run workloads (SPEC CPU2006, SPECpower_ssj2008, etc.). • Measure power of Vdd voltage rail. VSU & FPU ISU _ , Chip Vdd power proxy has a mean error of 0.2% (2.6% std dev). Power proxy tracks change in voltage, frequency, temperature, and workload activity. POWER7 Chiplet (core + L2 + L3) CORE 0 ) , *+ ,-./+ Normalized Chip Power 1200 Power measurement (left axis) Power proxy (left axis) Vdd voltage (right axis) 1 1150 1100 0.9 1050 0.8 1000 Vdd voltage (mV) 950 0.7 900 Observations • Tracks changes in voltage, frequency, temperature, and workload activity • Power estimation made every 32 ms (30x faster than prior work) • Power proxy is accurate even when voltage and frequency do not have fixed pairings. Useful for undervolting (with fixed frequency) and overclocking (with fixed voltage) scenarios. • Power proxy implementation on service processor and system management network does not impact workload performance. IFU LSU Core Activity L2 5 events 0.6 850 0.5 800 0 5 10 Time (s) 15 20 25 • Fixed frequency run of dealII workload. • Power proxy continues to track actual power while undervolting up to 112.5 mV. Applications L3 4 events Enable billing of energy consumption for virtual machines on a per-core basis. Activity Proxy NCU 1 = Activity Sense point The weights to different activity events are programmable by writing to special on-chip registers. The EnergyScale microcontroller receives the activity proxies and adjusts them to account for the effects of leakage, temperature, process variations and voltage to form chip and core power proxies. Estimation of core power Bulk Power Supply Vdd rail (12 V) Sense circuitry Voltage regulator POWER7+ chip A/D EnergyScale microcontroller 1.1 0.9 0.9 1 0.8 0.8 0.9 0.7 0.7 0.8 0.6 0.6 C2 power proxy (mload) 0.5 0.5 C3 power proxy (sqroot) 0.4 0.4 C4 power proxy (mcopy) 0.3 0.3 Power sensor (DPS) Power proxy (DPS) Power sensor (DPS,UV) Power proxy (DPS,UV) Power sensor (Nominal) Power proxy (Nominal) Chip measured Vdd power C5 power proxy (fma) C6 power proxy (daxpy) 0.7 Normalized chip Vdd power 100% load 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0.2 0 0.1 0 Activity proxy per core, clock frequency per core, temperature per core, voltage Improve power management controllers by forecasting power due to change in voltage, frequency, temperature, and workload. 1 C7 power proxy (mlr) 1 80 159 238 317 396 475 554 633 712 791 870 Per-core power estimation Normalized chip Vdd power Time (32 ms) • Run 6 workloads on 6-core chip and compare to real chip Vdd power (3% err). • Power proxy tracks thermal rise. 10 astar bwaves bzip2 cactusADM calculix dealII gamess gcc GemsFDTD gobmk gromacs h264ref hmmer lbm leslie3d libquantum mcf milc namd omnetpp perlbench povray sjeng soplex sphinx3 tonto wrf xalanbmk zeusmp Average Counters and active power " , Genetic Algorithm-based Optimization Absolute percent err (%) 2 4 6 8 0 Measure activity counters Measure power on Vdd rail Pactive = Pmeasured - Pidle(V,F,T) Measure power on Vdd rail and measure chip temperature Conventional power measurement of a chip voltage rail Validation: Pchip measurement matches Pidle + Pactive models. 40% load 20% load 0 0 100 200 Time (s) 300 400 • SPECPower_ssj2008 run under different conditions UV = undervolting, DPS = Dynamic Power Saver (voltage and frequency scaling).