Slides: PDF

TAPO: Thermal-Aware Power Optimization
Techniques for Servers and Data Centers
Wei Huang
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Overview
Team work at IBM Research:
Austin: Malcolm Allen-Ware, John Carter, Mootaz Elnozahy, Tom
Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio
T. J. Watson: Hendrik Hamann
Objective: Power Optimization of an entire system (e.g., server,
DC), with explicit consideration of Cooling Power
Hierarchical Techniques:
Server-level power (TAPO-server):
Datacenter-level power (TAPO-dc):
P2
Fan power vs. leakage power
Goal: minimize aggregate fan+leakage power
Prototyped on a POWER 750 Express server (POWER7-based).
HVAC power vs. server fan power
Goal: minimize aggregate HVAC+server power
Analysis based on realistic models
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Background
Thermal setpoints are fixed
Server temperature setpoint, e.g. 70C for POWER7 processors
Data Center (DC) HVAC chiller setpoint (cooled water), e.g. 10C
System dynamics are not considered, can be power inefficient overcooling and wasting cooling power.
Cooling-related power components
DC HVAC power (chiller, blower, etc)
Comparable to IT power
Characteristics: warmer environment, higher chiller setpoint, lower chiller
power
Server fan power:
Has been part of IT power, but really should be considered separately
Strong superlinear (~ quadratic or cubic) relationship to fan speed
Server (processor) leakage power:
P3
PUE is not an accurate indicator
Strongly temperature dependent
To reduce leakage, want more server fan power to cool chips down
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Overview
Team work:
Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy,
Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio
Watson: Hendrik Hamann
Objective: Optimize power and/or performance of an entire system
(e.g., server, DC), with explicit consideration of cooling power
Hierarchical Techniques:
Server-level power (TAPO-server):
Datacenter-level power (TAPO-dc):
P4
Fan power vs. leakage power
Goal: minimize aggregate fan+leakage power
Prototyped on a POWER 750 Express server.
HVAC power vs. server fan power
Goal: minimize aggregate HVAC+server power
Analysis based on realistic models
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-server
1.0 2
Optimize server fan + processor leakage power, what is the power saving
potential?
Manual characterization:
POWER7-based server
Turbo frequency (3.864GHz), CPU-intensive workload, L2
resident, 32 SMT4 cores
Total server power Default system op point,
Server fan power
settle down to ~70C
1.2
1
2
y = 5E-08x - 0.0002x + 0.2602
n o rm a liz e d fa n p o w e r
Server total power
0 .9 8
5.4% peak
power
saving
0 .9 6
0 .9 4
optimal op point
0 .9 2
Freq drop due
to overheat
0 .9
0 .8 8
1
0.8
0.6
0.4
70% fan power
reduction
(2% processor
power increase)
0.2
0 .8 6
0
2000
0 .8 4
0
10 0 0
2 0 00
3 0 00
4 0 00
5 00 0
6 00 0
7 00 0
8 00 0
9 00 0
3000
4000
5000
6000
Server Fan Speed (in RPM)
fan speed (RPM)
P5
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Search for optimal thermal setpoint in TAPO-server
Change processor thermal setpoint
Indirectly change fan speed
On the curve:
Left: fan speed low, more thermal-induced leakage power
Right: system is cool, but more fan power
∆P>0, ∆T>0 → decrease
Server Power
Adjust thermal setpoint
Air Flow
Tthr to raise fan speed
Time 1
Time 2
Time 2
Time 1
∆P<0, ∆T>0 → increase Tthr
Monitor ∆P and ∆T
Server
P6
to reduce fan speed
Fan
Fan RPM
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-server discussions
P7
Power convergence threshold: 5 Watts.
Sampled every 32ms.
Entirely depends on measurements, no models involved.
reduce peak power at peak performance.
Save ~5% peak power, a perfect solution would have been 5.4%
No observed performance loss (frequency and voltage are fixed).
Regardless of workload, chip variations and environment, TAPOserver should adaptively find the optimal point.
Slow convergence: wait long enough (30 seconds to 2 minutes) for
temperature to settle down after fan speed changes.
For safety, there is an upper limit on thermal threshold (if exceeded,
use DVFS to prevent thermal emergency).
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-server results
server fan power
total proc power
Normalized Power
Normalized Power
1
server fan power
0.8
0.6
0.4
0.2
0
10
20
time (minutes)
30
40
1.08
1.06
total proc power
1.04
1.02
1
0.98
0
10
40
processor thermal setpoint
1
75
total server power
74
73
C
Normalized Power
30
time (minutes)
total server power
0.98
20
0.96
72
0.94
0.92
0
thermal setpoint
71
10
20
time (minutes)
30
40
70
0
10
20
30
40
time (minutes)
Prototyped new model-based control method reduces convergence time to ~1 minute
P8
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Overview
Team work:
Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy,
Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio
Watson: Hendrik Hamann
Objective: Optimize power and/or performance of an entire system
(e.g., server, DC), with explicit consideration of cooling power
Hierarchical Techniques:
Server-level power (TAPO-server):
Datacenter-level power (TAPO-dc):
P9
Fan power vs. leakage power
Goal: minimize aggregate fan+leakage power
Prototyped on a P7 HV32 server.
HVAC power vs. server fan power
Goal: minimize aggregate HVAC+server power
Analysis based on realistic models
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-dc
HVAC
IT
1.2
2
n o r m a li z e d f a n p o w e r
y = 5E-08x - 0.0002x + 0.2602
1
Server fan power
0.8
Tradeoff between HVAC power and server fan power
Use chilled water setpoint to adjust HVAC power
Based on published component power models
Two chiller designs (COP 3.0-6.0 and 4.1-5.5)
T_inlet = T_chiller + 10C
Server inlet temperature range: 20C ~ 40C
P 10
0.6
0.4
0.2
0
2000
3000
4000
5000
6000
Server Fan Speed (in RPM)
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-dc results
Assuming a rack of ten POWER 750 Express servers
Fully utilized DC cooling zone
Fully Utilized Datacenter
1
0.98
0.9
0.96
0.8
0.94
0.7
0.92
0.6
0.9
0.5
0.88
0.4
0.86
0.3
0.84
0.2
0.82
0.1
0.8
Normalized Total Power
(Bars)
Normalized Total Power
(Lines)
1
0
20
22
24
26
28
30
32
34
36
38
40
Server Inlet Temperature (C)
IT power w/o fan (wide chiller COP)
blower power (wide chiller COP)
IT power w/o fan (narrow chiller COP)
blower power (narrow chiller COP)
norm. total power, wide COP (left Y-axis)
P 11
chiller power (wide chiller COP)
IT fan power (wide chiller COP)
chiller power (narrow chiller COP)
IT fan power (narrow chiller COP)
norm. total power, narrow COP (left Y-axis)
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-dc results (cont’d)
10% utilized DC cooling zone
10% Utilized Datacenter
1
0.98
0.9
0.96
0.8
0.94
0.7
0.92
0.6
0.9
0.5
0.88
0.4
0.86
0.3
0.84
0.2
0.82
0.1
0.8
Normalized Total Power (Bars)
Normalized Total Power (Lines)
1
0
20
22
24
26
28
30
32
34
36
38
40
Server Inlet Temperature (C)
P 12
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPO-dc results (cont’d)
60% utilized DC cooling zone
60% Utilized Datacenter
1
1
0.9
Normalized Total Power (Bars)
Normalized Total Power (Lines)
0.98
0.96
0.8
0.94
0.7
0.92
0.6
0.9
0.5
0.88
0.4
0.86
0.3
0.84
0.2
0.82
0.1
0.8
0
20
22
24
26
28
30
32
34
36
38
40
Server Inlet Temperature (C)
P 13
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
No single thermal setpoint is optimal
Dynamically searching for optimal point
is not tractable
Thermal mass, HVAC complexity
Binary control, based on utilization
level
Monitor average utilization level of a DC cooling
zone (e.g. over 1 hour)
0.18
p o w e r s a v in g w .r.t. w o rs t c a s e
TAPO-dc control method
binary, wide COP
optimal, wide COP
binary, narrow COP
optimal, narrow COP
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
Utilization level is
low? (eg. <50%)
N
0.2
0.4
0.6
0.8
1
datacenter utilization level
Y
Set HVAC set point to high
temperature (e.g. 35C)
Set HVAC set point to low
temperature (e.g. 27C)
Wait for fixed period or for stabilization
of DC thermals (e.g. 1 hour)
P 14
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Conclusions and Ongoing work
P 15
Finding the right thermal setpoint helps save total system power,
without performance hit
TAPO-server and TAPO-dc
Ongoing work
Prototype TAPO-dc in a real data center
Make TAPO-server converge faster
Understand the delicate interactions among the two
techniques
Warmer ambient from TAPO-dc makes TAPO-server
more valuable
TAPO-server lowers server fan power, favoring TAPOdc with warmer chiller setpoint to reduce HVAC power.
Reliability concerns of server components running at
slightly hotter temperatures
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Thank you. Questions?
P 16
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
More materials…
P 17
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Overview
It is a team work:
Austin: Wei Huang, Malcolm Allen-Ware, John Carter, Mootaz Elnozahy,
Tom Keller, Charles Lefurgy, Jian Li, Karthick Rajamani, Juan Rubio
Watson: Hendrik Hamann
Objective: Optimize power and/or performance of an entire system (e.g., server,
Hierarchical Techniques:
DC), with explicit consideration of cooling power
Server-level power (TAPO-server):
P 18
Fan power vs. leakage power
Goal: minimize aggregate fan+leakage power
Prototyped on a P7 HV32 server.
Datacenter-level power (TAPO-dc):
HVAC power vs. server fan power
Goal: minimize aggregate HVAC+server power
Server-level performance (TAPO-shift):
Load imbalance in different cooling zones
State of the art can’t fully exploit power shifting from an idle zone to an
active zone, due to thermal limitations
Goal: maximize active zone performance, within power and thermal
budgets
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPPO-shift
P 19
Power shifting:
Idea: Shift unused power budget in underutilized parts to
boost performance of highly utilized parts
Total power constraint, thermal constraint
Shifting among cooling zones. Example: socket to socket,
server to server, rack to rack, DC zone to DC zone, etc
Limitations:
Each cooling zone is design independently, without cooling
capability for significantly more power
On the other hand, server processors can be overclocked by
~25% above nominal – hard to achieve in reality due to
thermal limits
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPPO-shift
P 20
Solution: over-provisioned cooling capacity (by a large
margin) in each cooling zone
Cost is small: better/more fans
Benefit: higher performance (e.g. processor can run at
much higher frequency with shifted power)
Within the same overall power budget across cooling
zones, no thermal violation
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPPO-shift: an illustrative example power shifting
Fan
Component 1
Component 2
“turbo” fan 1
“turbo” fan 2
Component 1
Component 2
Zone 1
One zone
Power
server power budget
same 2N Fans more
2N Fans
P 21
C2
Same,
nominal
C1
less
Case A
N Fans 2
C1
Case B
Case A: one zone,
balanced, fully
loaded
N Fans 1
More, C2
C2 turbo
same
Zone 2
C1
Case C
Case B: one zone,
unbalanced
Case C: two zones,
unbalanced, shifting
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
TAPPO-shift results
Use P7 power-frequency relationship (cubic)
Use P7 HV32 system power and fan power (almost cubical to rpm)
4 sockets divided into two cooling zones (each has separate fan
control and better fans )
Potentially 16% higher than P7 Turbo frequency
Power scaling with DVFS (4 early samples)
N o r m a liz e d s o c k e t
pow er
250%
R2 = 0.9982
200%
150%
100%
50%
0%
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0%
20%
40%
60%
80% 100% 120% 140%
Normalized socket frequency
P 22
relative socket frequency gain
0.18
0
0.2
0.4
0.6
0.8
1
1.2
portion of zone computing power shifting
© 2011 IBM Corporation
IBM Research – Austin, Wei Huang et al.
Combined TAPPO techniques – qualitative example
Two DC cooling zones, Zone1 is 80% utilized, Zone2 is 10% utilized
Workload migration to make Zone2 idle
Observations:
Migration itself does not save power, but turning off idle zone does!
TAPPO-dc and -server can save about 9% power in this example
Combined with TAPPO-shift, can boost active zone utilization by 10%
with about the same power
1.02
120%
1
100%
0.98
80%
0.96
60%
0.94
40%
0.92
20%
0%
0.9
no migration
P 23
zone utilization levels (lines)
Normalized total power (bars)
Migrating 80% and 10% utilized cooling zones in a datacenter
migration
add TAPPO-dc
add TAPPO-server
add TAPPO-shift
normalized total power (left Y-axis)
zone 1 utilization (right Y-axis)
zone 2 utilization (right Y-axis)
© 2011 IBM Corporation