.pdf

Building Adaptive Performance
Models for Dynamic Resource
Allocation in Cloud Data Centers
Jin Chen
University of Toronto
Joint work with Gokul Soundararajan and Prof. Cristiana Amza.
Today’s Cloud
Pay for your resources!
small, large, extra large instances
data stored, I/O rate, data transfered
$8 per users
Dream: QoS Cloud
Pay for your performance!
Average query latency < 1 sec
Customer: Our
workload is usually
stable, but there will
be a few unpredicted
peak times.
Cloud Admin: how
many resources should
we provision
dynamically?
Cloud Admin: What If Question
What is the performance of this application
given 2 CPUs, 4G RAM, 300 IOPS ?
Challenges
• Performance interference
-
Between consolidated workloads
-
Increase in # of resources, e.g. multi-level cache
Uncontrolled resource sharing affects performance
• Increasingly complicated systems
System behavior varies, e.g. pre-fetching, background
operations,...
• Workload itself causes variance/noise
Our Solution
• Build predictable systems in cloud
-
Partition critical resources
•
CPU, Memory, Network, Storage
• Build performance models for each app
-
Answer what if question on-line
-
Consolidate workloads in cloud
• Dynamically allocate resources
• Monitor and correct performance models
Build Predictable System in Cloud
Partition and allocate critical resources dynamically.
Switch/Router
Storage Array
Net Bandwidth
...
Database Servers
CPU, DB Buffer Pool
Storage Cache, Storage Bandwidth
Applications will have minimum interference
from each other!
Performance Model for Application
Avg.
Latency
High
Latency
...
e
ag ize
r
o
St he S
c
Ca
Low
Latency
768
512
Storage Cac
256
Buffer
P
ool Si
ze
8
Latency
Multi-level Resource Allocation
e
ag e
or Siz
t
S he
c
Ca
Buffer
Pool S
i
Latency
ze
e
ag e
or Siz
t
S he
c
Ca
Buffer
Pool S
i
ze
9
Our Goal
• Challenge of building performance models
-
Large number of configurations to predict performance
•
(CPU, Buffer Pool Size, Network, Storage cache, ...)
-
Increasingly complicated systems
-
Acceptable accuracy
• Build multi-level performance model with
In a short amount of time
Adapt to system changes
Outline
• Overview of different types of models
-
Analytical models
Black box models
• Our approach: Chorus
• Experimental results
Different Types of Models
Analytical
Extensive domain knowledge
Fast to get results
Acceptable accuracy
Difficult to adapt
Black Box
Minimum domain knowledge
May take a long time
Higher accuracy
Easy to adapt
Black Box Performance Models
Avg.
Latency
High
Latency
Low
Latency
ge
a
or he
t
S ac
C
1024
768
512
Storage Cache Size
256
Buffer
P
B ff
P
l
512
ool S
Si
(MB)
ize
768
256
13
Outline
• Overview of different types of models
• Our approach: Chorus
-
Exploit incomplete expert knowledge
Leverage individual models
Automated the modeling process
Optimizations
• Experimental results
Exploit Partial Expert Knowledge
Analytical
Gray Box
Black Box
Expert Knowledge: Cache Inclusiveness
8 8
1
6
8 4
4
DB Buffer Pool
Storage Cache
4
3
5
6
4
LRU
3
6
3
5
I/Os: 6
8
LRU
16
Expert Knowledge: Cache Inclusiveness
8 8
DB Buffer Pool
Storage Cache
1
6
4
8 4
3
6
3
5
5
5
I/Os: 6
6
8
3
4
LRU
LRU
17
Approximate Single Cache Model (LRU)
8 8
1
6
8 4
5
6
3
6
4
LRU
4 3
4
3
5
I/Os: 6
8
LRU
18
Gray Box Multi-level Cache Model
f(CPU, Buffer pool size, storage cache size, storage bandwidth)
Gray box model: Ignore the smaller cache size
f(CPU, Bigger Cache Size, storage bandwidth)
Greatly reduce the # of configurations to predict!
Gray Box Curve Fitting Model
Analytical
Avg. Latency
Ld (1)
Ld (ρd ) =
ρd
Gray box
Ld (ρd ) =
α
ρβd
Disk Bandwidth
20
Outline
• Overview of different types of models
• Our approach: Chorus
-
Exploit incomplete expert knowledge: gray box model
Leverage individual models
Automated the modeling process
Optimizations
• Experimental results
Build Performance Model
11 days!
Query
Latency
High
Latency
Low
Latency
sk th
Di wid
nd
a
B
1024
768
Buff256
er Po 512
ol Siz
e
B ff
P
l Si
(MB)
512
Storage Cache Size
768
256
Leverage Individual Models
Query
Latency
5 days! 3 days (manually)!
High
Latency
Low
Latency
CPU
sk th
Di wid
nd
a
B
1024
768
Buff256
er Po 512
ol Siz
e
B ff
P
l Si
(MB)
512
Storage Cache Size
768
256
Have we simplified the management problem?
Iteratively Training
Refine
if necessary
Per
Region
Optimizations of Chorus
• Train models from history
-
Train new workload from similar saved workloads’
models
-
Similarity test; only train regions with low accuracy
-
Find the boundary configurations meeting SLA
• Prune configurations
Cut configurations with less resources than boundary
ones
Outline
• Overview of different types of models
• Our approach: Chorus
-
Exploit incomplete expert knowledge: gray box model
Leverage individual models: ensemble learning
Automated the modeling process: iteratively training
Optimizations: history and pruning
• Experimental results
Evaluation Platform
Chorus Composition
• Gray box models
-
G-LR: region based linear model
G-INV: inverse shape based curve fitting model
Chorus Composition
• Gray box models
-
G-LR: region based linear model
-
CPU and DISK
-
B-SVM: support vector machine regression
G-INV: inverse shape based curve fitting model
• Analytical models
• Black box models
B-CR: use average as prediction results per region
Evaluate Prediction Accuracy
Measured Mean
Accuracy: the percent of good predictions
e.g. 3/5 = 60%
Orion Workload
100%
60%
1 ~ 2 STD
0 ~ 1 STD
40%
15%
30%
Percentage of Total Training Samples
60%
G−LR
G−INV
B−CR
B−SVM
Chorus
G−LR
G−INV
B−CR
B−SVM
Chorus
G−LR
G−INV
B−CR
0%
B−SVM
20%
Chorus
Accuracy of Predictions
80%
Orion Workload
Composition of Chorus
100%
80%
60%
G-LR
G-INV
B-CR
B-SVM
40%
20%
0%
15%
30%
60%
Percentage of Total Training Samples
TPC-C Workload
100%
60%
1 ~ 2 STD
0 ~ 1 STD
40%
15%
30%
Percentage of Total Training Samples
60%
G−LR
G−INV
B−CR
B−SVM
Chorus
G−LR
G−INV
B−CR
B−SVM
Chorus
G−LR
G−INV
B−CR
0%
B−SVM
20%
Chorus
Accuracy of Predictions
80%
TPC-C Workload
80%
G-LR
G-INV
B-CR
B-SVM
60%
40%
0%
15%
30%
0
20%
0
Composition of Chorus
100%
60%
Percentage of Total Training Samples
TPC-W Workload
100%
60%
1 ~ 2 STD
0 ~ 1 STD
40%
15%
30%
Percentage of Total Training Samples
60%
DISK
CPU
Chorus
DISK
CPU
Chorus
DISK
0%
CPU
20%
Chorus
Accuracy of Predictions
80%
TPC-C Resource Allocation
4.5
4
TPC−C
TPC−C
TPC−C
TPC−C
Latency (ms)
3.5
3
2.5
SLO
2
1.5
1
0.5
0
Chorus
Equal
Allocation Scheme
IDEAL*
Conclusion
• Key to implement QoS Cloud
-
Build predictable systems
-
Exploit partial domain knowledge: gray box model
Build accurate performance models
• Our Chorus can build accurate models:
Leverage individual models: ensemble learning
From scratch and from history
Allocate resources properly in QoS Cloud
Lessons and Future Work
• QoS Cloud is still difficult
-
Some workload is hard to model with high accuracy
-
Chorus suggests new model, and admin verifies it.
May relax application goals, e.g. allow larger variations
• Feedback loop between admin. and Chorus
Admin adds new model, and Chorus verifies it.
End.
Thanks!
Similar pages