TAPO: Thermal-Aware Power Optimization Techniques for Servers and Data Centers Wei Huang © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Overview Team work at IBM Research: Austin: Malcolm Allen-Ware, John Carter, Mootaz Elnozahy, Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio T. J. Watson: Hendrik Hamann Objective: Power Optimization of an entire system (e.g., server, DC), with explicit consideration of Cooling Power Hierarchical Techniques: Server-level power (TAPO-server): Datacenter-level power (TAPO-dc): P2 Fan power vs. leakage power Goal: minimize aggregate fan+leakage power Prototyped on a POWER 750 Express server (POWER7-based). HVAC power vs. server fan power Goal: minimize aggregate HVAC+server power Analysis based on realistic models © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Background Thermal setpoints are fixed Server temperature setpoint, e.g. 70C for POWER7 processors Data Center (DC) HVAC chiller setpoint (cooled water), e.g. 10C System dynamics are not considered, can be power inefficient overcooling and wasting cooling power. Cooling-related power components DC HVAC power (chiller, blower, etc) Comparable to IT power Characteristics: warmer environment, higher chiller setpoint, lower chiller power Server fan power: Has been part of IT power, but really should be considered separately Strong superlinear (~ quadratic or cubic) relationship to fan speed Server (processor) leakage power: P3 PUE is not an accurate indicator Strongly temperature dependent To reduce leakage, want more server fan power to cool chips down © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Overview Team work: Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy, Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio Watson: Hendrik Hamann Objective: Optimize power and/or performance of an entire system (e.g., server, DC), with explicit consideration of cooling power Hierarchical Techniques: Server-level power (TAPO-server): Datacenter-level power (TAPO-dc): P4 Fan power vs. leakage power Goal: minimize aggregate fan+leakage power Prototyped on a POWER 750 Express server. HVAC power vs. server fan power Goal: minimize aggregate HVAC+server power Analysis based on realistic models © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-server 1.0 2 Optimize server fan + processor leakage power, what is the power saving potential? Manual characterization: POWER7-based server Turbo frequency (3.864GHz), CPU-intensive workload, L2 resident, 32 SMT4 cores Total server power Default system op point, Server fan power settle down to ~70C 1.2 1 2 y = 5E-08x - 0.0002x + 0.2602 n o rm a liz e d fa n p o w e r Server total power 0 .9 8 5.4% peak power saving 0 .9 6 0 .9 4 optimal op point 0 .9 2 Freq drop due to overheat 0 .9 0 .8 8 1 0.8 0.6 0.4 70% fan power reduction (2% processor power increase) 0.2 0 .8 6 0 2000 0 .8 4 0 10 0 0 2 0 00 3 0 00 4 0 00 5 00 0 6 00 0 7 00 0 8 00 0 9 00 0 3000 4000 5000 6000 Server Fan Speed (in RPM) fan speed (RPM) P5 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Search for optimal thermal setpoint in TAPO-server Change processor thermal setpoint Indirectly change fan speed On the curve: Left: fan speed low, more thermal-induced leakage power Right: system is cool, but more fan power ∆P>0, ∆T>0 → decrease Server Power Adjust thermal setpoint Air Flow Tthr to raise fan speed Time 1 Time 2 Time 2 Time 1 ∆P<0, ∆T>0 → increase Tthr Monitor ∆P and ∆T Server P6 to reduce fan speed Fan Fan RPM © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-server discussions P7 Power convergence threshold: 5 Watts. Sampled every 32ms. Entirely depends on measurements, no models involved. reduce peak power at peak performance. Save ~5% peak power, a perfect solution would have been 5.4% No observed performance loss (frequency and voltage are fixed). Regardless of workload, chip variations and environment, TAPOserver should adaptively find the optimal point. Slow convergence: wait long enough (30 seconds to 2 minutes) for temperature to settle down after fan speed changes. For safety, there is an upper limit on thermal threshold (if exceeded, use DVFS to prevent thermal emergency). © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-server results server fan power total proc power Normalized Power Normalized Power 1 server fan power 0.8 0.6 0.4 0.2 0 10 20 time (minutes) 30 40 1.08 1.06 total proc power 1.04 1.02 1 0.98 0 10 40 processor thermal setpoint 1 75 total server power 74 73 C Normalized Power 30 time (minutes) total server power 0.98 20 0.96 72 0.94 0.92 0 thermal setpoint 71 10 20 time (minutes) 30 40 70 0 10 20 30 40 time (minutes) Prototyped new model-based control method reduces convergence time to ~1 minute P8 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Overview Team work: Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy, Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio Watson: Hendrik Hamann Objective: Optimize power and/or performance of an entire system (e.g., server, DC), with explicit consideration of cooling power Hierarchical Techniques: Server-level power (TAPO-server): Datacenter-level power (TAPO-dc): P9 Fan power vs. leakage power Goal: minimize aggregate fan+leakage power Prototyped on a P7 HV32 server. HVAC power vs. server fan power Goal: minimize aggregate HVAC+server power Analysis based on realistic models © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-dc HVAC IT 1.2 2 n o r m a li z e d f a n p o w e r y = 5E-08x - 0.0002x + 0.2602 1 Server fan power 0.8 Tradeoff between HVAC power and server fan power Use chilled water setpoint to adjust HVAC power Based on published component power models Two chiller designs (COP 3.0-6.0 and 4.1-5.5) T_inlet = T_chiller + 10C Server inlet temperature range: 20C ~ 40C P 10 0.6 0.4 0.2 0 2000 3000 4000 5000 6000 Server Fan Speed (in RPM) © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-dc results Assuming a rack of ten POWER 750 Express servers Fully utilized DC cooling zone Fully Utilized Datacenter 1 0.98 0.9 0.96 0.8 0.94 0.7 0.92 0.6 0.9 0.5 0.88 0.4 0.86 0.3 0.84 0.2 0.82 0.1 0.8 Normalized Total Power (Bars) Normalized Total Power (Lines) 1 0 20 22 24 26 28 30 32 34 36 38 40 Server Inlet Temperature (C) IT power w/o fan (wide chiller COP) blower power (wide chiller COP) IT power w/o fan (narrow chiller COP) blower power (narrow chiller COP) norm. total power, wide COP (left Y-axis) P 11 chiller power (wide chiller COP) IT fan power (wide chiller COP) chiller power (narrow chiller COP) IT fan power (narrow chiller COP) norm. total power, narrow COP (left Y-axis) © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-dc results (cont’d) 10% utilized DC cooling zone 10% Utilized Datacenter 1 0.98 0.9 0.96 0.8 0.94 0.7 0.92 0.6 0.9 0.5 0.88 0.4 0.86 0.3 0.84 0.2 0.82 0.1 0.8 Normalized Total Power (Bars) Normalized Total Power (Lines) 1 0 20 22 24 26 28 30 32 34 36 38 40 Server Inlet Temperature (C) P 12 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPO-dc results (cont’d) 60% utilized DC cooling zone 60% Utilized Datacenter 1 1 0.9 Normalized Total Power (Bars) Normalized Total Power (Lines) 0.98 0.96 0.8 0.94 0.7 0.92 0.6 0.9 0.5 0.88 0.4 0.86 0.3 0.84 0.2 0.82 0.1 0.8 0 20 22 24 26 28 30 32 34 36 38 40 Server Inlet Temperature (C) P 13 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. No single thermal setpoint is optimal Dynamically searching for optimal point is not tractable Thermal mass, HVAC complexity Binary control, based on utilization level Monitor average utilization level of a DC cooling zone (e.g. over 1 hour) 0.18 p o w e r s a v in g w .r.t. w o rs t c a s e TAPO-dc control method binary, wide COP optimal, wide COP binary, narrow COP optimal, narrow COP 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 Utilization level is low? (eg. <50%) N 0.2 0.4 0.6 0.8 1 datacenter utilization level Y Set HVAC set point to high temperature (e.g. 35C) Set HVAC set point to low temperature (e.g. 27C) Wait for fixed period or for stabilization of DC thermals (e.g. 1 hour) P 14 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Conclusions and Ongoing work P 15 Finding the right thermal setpoint helps save total system power, without performance hit TAPO-server and TAPO-dc Ongoing work Prototype TAPO-dc in a real data center Make TAPO-server converge faster Understand the delicate interactions among the two techniques Warmer ambient from TAPO-dc makes TAPO-server more valuable TAPO-server lowers server fan power, favoring TAPOdc with warmer chiller setpoint to reduce HVAC power. Reliability concerns of server components running at slightly hotter temperatures © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Thank you. Questions? P 16 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. More materials… P 17 © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Overview It is a team work: Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy, Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio Watson: Hendrik Hamann Objective: Optimize power and/or performance of an entire system (e.g., server, Hierarchical Techniques: DC), with explicit consideration of cooling power Server-level power (TAPO-server): P 18 Fan power vs. leakage power Goal: minimize aggregate fan+leakage power Prototyped on a P7 HV32 server. Datacenter-level power (TAPO-dc): HVAC power vs. server fan power Goal: minimize aggregate HVAC+server power Server-level performance (TAPO-shift): Load imbalance in different cooling zones State of the art can’t fully exploit power shifting from an idle zone to an active zone, due to thermal limitations Goal: maximize active zone performance, within power and thermal budgets © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPPO-shift P 19 Power shifting: Idea: Shift unused power budget in underutilized parts to boost performance of highly utilized parts Total power constraint, thermal constraint Shifting among cooling zones. Example: socket to socket, server to server, rack to rack, DC zone to DC zone, etc Limitations: Each cooling zone is design independently, without cooling capability for significantly more power On the other hand, server processors can be overclocked by ~25% above nominal – hard to achieve in reality due to thermal limits © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPPO-shift P 20 Solution: over-provisioned cooling capacity (by a large margin) in each cooling zone Cost is small: better/more fans Benefit: higher performance (e.g. processor can run at much higher frequency with shifted power) Within the same overall power budget across cooling zones, no thermal violation © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPPO-shift: an illustrative example power shifting Fan Component 1 Component 2 “turbo” fan 1 “turbo” fan 2 Component 1 Component 2 Zone 1 One zone Power server power budget same 2N Fans more 2N Fans P 21 C2 Same, nominal C1 less Case A N Fans 2 C1 Case B Case A: one zone, balanced, fully loaded N Fans 1 More, C2 C2 turbo same Zone 2 C1 Case C Case B: one zone, unbalanced Case C: two zones, unbalanced, shifting © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. TAPPO-shift results Use P7 power-frequency relationship (cubic) Use P7 HV32 system power and fan power (almost cubical to rpm) 4 sockets divided into two cooling zones (each has separate fan control and better fans ) Potentially 16% higher than P7 Turbo frequency Power scaling with DVFS (4 early samples) N o r m a liz e d s o c k e t pow er 250% R2 = 0.9982 200% 150% 100% 50% 0% 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0% 20% 40% 60% 80% 100% 120% 140% Normalized socket frequency P 22 relative socket frequency gain 0.18 0 0.2 0.4 0.6 0.8 1 1.2 portion of zone computing power shifting © 2011 IBM Corporation IBM Research – Austin, Wei Huang et al. Combined TAPPO techniques – qualitative example Two DC cooling zones, Zone1 is 80% utilized, Zone2 is 10% utilized Workload migration to make Zone2 idle Observations: Migration itself does not save power, but turning off idle zone does! TAPPO-dc and -server can save about 9% power in this example Combined with TAPPO-shift, can boost active zone utilization by 10% with about the same power 1.02 120% 1 100% 0.98 80% 0.96 60% 0.94 40% 0.92 20% 0% 0.9 no migration P 23 zone utilization levels (lines) Normalized total power (bars) Migrating 80% and 10% utilized cooling zones in a datacenter migration add TAPPO-dc add TAPPO-server add TAPPO-shift normalized total power (left Y-axis) zone 1 utilization (right Y-axis) zone 2 utilization (right Y-axis) © 2011 IBM Corporation