Building Adaptive Performance Models for Dynamic Resource Allocation in Cloud Data Centers Jin Chen University of Toronto Joint work with Gokul Soundararajan and Prof. Cristiana Amza. Today’s Cloud Pay for your resources! small, large, extra large instances data stored, I/O rate, data transfered $8 per users Dream: QoS Cloud Pay for your performance! Average query latency < 1 sec Customer: Our workload is usually stable, but there will be a few unpredicted peak times. Cloud Admin: how many resources should we provision dynamically? Cloud Admin: What If Question What is the performance of this application given 2 CPUs, 4G RAM, 300 IOPS ? Challenges • Performance interference - Between consolidated workloads - Increase in # of resources, e.g. multi-level cache Uncontrolled resource sharing affects performance • Increasingly complicated systems System behavior varies, e.g. pre-fetching, background operations,... • Workload itself causes variance/noise Our Solution • Build predictable systems in cloud - Partition critical resources • CPU, Memory, Network, Storage • Build performance models for each app - Answer what if question on-line - Consolidate workloads in cloud • Dynamically allocate resources • Monitor and correct performance models Build Predictable System in Cloud Partition and allocate critical resources dynamically. Switch/Router Storage Array Net Bandwidth ... Database Servers CPU, DB Buffer Pool Storage Cache, Storage Bandwidth Applications will have minimum interference from each other! Performance Model for Application Avg. Latency High Latency ... e ag ize r o St he S c Ca Low Latency 768 512 Storage Cac 256 Buffer P ool Si ze 8 Latency Multi-level Resource Allocation e ag e or Siz t S he c Ca Buffer Pool S i Latency ze e ag e or Siz t S he c Ca Buffer Pool S i ze 9 Our Goal • Challenge of building performance models - Large number of configurations to predict performance • (CPU, Buffer Pool Size, Network, Storage cache, ...) - Increasingly complicated systems - Acceptable accuracy • Build multi-level performance model with In a short amount of time Adapt to system changes Outline • Overview of different types of models - Analytical models Black box models • Our approach: Chorus • Experimental results Different Types of Models Analytical Extensive domain knowledge Fast to get results Acceptable accuracy Difficult to adapt Black Box Minimum domain knowledge May take a long time Higher accuracy Easy to adapt Black Box Performance Models Avg. Latency High Latency Low Latency ge a or he t S ac C 1024 768 512 Storage Cache Size 256 Buffer P B ff P l 512 ool S Si (MB) ize 768 256 13 Outline • Overview of different types of models • Our approach: Chorus - Exploit incomplete expert knowledge Leverage individual models Automated the modeling process Optimizations • Experimental results Exploit Partial Expert Knowledge Analytical Gray Box Black Box Expert Knowledge: Cache Inclusiveness 8 8 1 6 8 4 4 DB Buffer Pool Storage Cache 4 3 5 6 4 LRU 3 6 3 5 I/Os: 6 8 LRU 16 Expert Knowledge: Cache Inclusiveness 8 8 DB Buffer Pool Storage Cache 1 6 4 8 4 3 6 3 5 5 5 I/Os: 6 6 8 3 4 LRU LRU 17 Approximate Single Cache Model (LRU) 8 8 1 6 8 4 5 6 3 6 4 LRU 4 3 4 3 5 I/Os: 6 8 LRU 18 Gray Box Multi-level Cache Model f(CPU, Buffer pool size, storage cache size, storage bandwidth) Gray box model: Ignore the smaller cache size f(CPU, Bigger Cache Size, storage bandwidth) Greatly reduce the # of configurations to predict! Gray Box Curve Fitting Model Analytical Avg. Latency Ld (1) Ld (ρd ) = ρd Gray box Ld (ρd ) = α ρβd Disk Bandwidth 20 Outline • Overview of different types of models • Our approach: Chorus - Exploit incomplete expert knowledge: gray box model Leverage individual models Automated the modeling process Optimizations • Experimental results Build Performance Model 11 days! Query Latency High Latency Low Latency sk th Di wid nd a B 1024 768 Buff256 er Po 512 ol Siz e B ff P l Si (MB) 512 Storage Cache Size 768 256 Leverage Individual Models Query Latency 5 days! 3 days (manually)! High Latency Low Latency CPU sk th Di wid nd a B 1024 768 Buff256 er Po 512 ol Siz e B ff P l Si (MB) 512 Storage Cache Size 768 256 Have we simplified the management problem? Iteratively Training Refine if necessary Per Region Optimizations of Chorus • Train models from history - Train new workload from similar saved workloads’ models - Similarity test; only train regions with low accuracy - Find the boundary configurations meeting SLA • Prune configurations Cut configurations with less resources than boundary ones Outline • Overview of different types of models • Our approach: Chorus - Exploit incomplete expert knowledge: gray box model Leverage individual models: ensemble learning Automated the modeling process: iteratively training Optimizations: history and pruning • Experimental results Evaluation Platform Chorus Composition • Gray box models - G-LR: region based linear model G-INV: inverse shape based curve fitting model Chorus Composition • Gray box models - G-LR: region based linear model - CPU and DISK - B-SVM: support vector machine regression G-INV: inverse shape based curve fitting model • Analytical models • Black box models B-CR: use average as prediction results per region Evaluate Prediction Accuracy Measured Mean Accuracy: the percent of good predictions e.g. 3/5 = 60% Orion Workload 100% 60% 1 ~ 2 STD 0 ~ 1 STD 40% 15% 30% Percentage of Total Training Samples 60% G−LR G−INV B−CR B−SVM Chorus G−LR G−INV B−CR B−SVM Chorus G−LR G−INV B−CR 0% B−SVM 20% Chorus Accuracy of Predictions 80% Orion Workload Composition of Chorus 100% 80% 60% G-LR G-INV B-CR B-SVM 40% 20% 0% 15% 30% 60% Percentage of Total Training Samples TPC-C Workload 100% 60% 1 ~ 2 STD 0 ~ 1 STD 40% 15% 30% Percentage of Total Training Samples 60% G−LR G−INV B−CR B−SVM Chorus G−LR G−INV B−CR B−SVM Chorus G−LR G−INV B−CR 0% B−SVM 20% Chorus Accuracy of Predictions 80% TPC-C Workload 80% G-LR G-INV B-CR B-SVM 60% 40% 0% 15% 30% 0 20% 0 Composition of Chorus 100% 60% Percentage of Total Training Samples TPC-W Workload 100% 60% 1 ~ 2 STD 0 ~ 1 STD 40% 15% 30% Percentage of Total Training Samples 60% DISK CPU Chorus DISK CPU Chorus DISK 0% CPU 20% Chorus Accuracy of Predictions 80% TPC-C Resource Allocation 4.5 4 TPC−C TPC−C TPC−C TPC−C Latency (ms) 3.5 3 2.5 SLO 2 1.5 1 0.5 0 Chorus Equal Allocation Scheme IDEAL* Conclusion • Key to implement QoS Cloud - Build predictable systems - Exploit partial domain knowledge: gray box model Build accurate performance models • Our Chorus can build accurate models: Leverage individual models: ensemble learning From scratch and from history Allocate resources properly in QoS Cloud Lessons and Future Work • QoS Cloud is still difficult - Some workload is hard to model with high accuracy - Chorus suggests new model, and admin verifies it. May relax application goals, e.g. allow larger variations • Feedback loop between admin. and Chorus Admin adds new model, and Chorus verifies it. End. Thanks!