Scalable Distributed Decision Trees in Spark MLlib Manish Amde, Origami Logic Hirakendu Das, Yahoo! Labs Evan Sparks, UC Berkeley Ameet Talwalkar, UC Berkeley Who is this guy? • Ph.D. in ECE from UC San Diego • Currently, data science at Origami Logic • Origami Logic • Search-based marketing intelligence platform • Work with global brands on large, unstructured marketing datasets • Using Spark for analytics Overview • Decision Tree 101 • Distributed Decision Trees in MLlib • Experiments • Ensembles • Future work Supervised Learning Train Features Features Features Features Features Predict Features Label Label Label Label Label ? Classification / Regression Features Features Features Features Features • Classification • Labels denote classes ! • Regression • Labels denote real numbers Label Label Label Label Label Car mileage from 1971! horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low Learn a model to predict the mileage (Binary classification!) Let’s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low Let’s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low Let’s Learn: Rule 1 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high 76 high low 88 high low If weight is high, mileage is low Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high : 4, low : 2} Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high : 4, low : 2} Training Data Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Split Candidates labels : {high : 4, low : 2} Training Data Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) labels : {high : 4, low : 2} Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High Yes labels : {high : 2, low : 0} labels : {high : 4, low : 2} No labels : {high : 2, low : 2} Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High Yes labels : {high : 2, low : 0} labels : {high : 4, low : 2} No labels : {high : 2, low : 2} Chose a split that causes maximum reduction in the label variability ! Chose a split that maximizes “information gain” Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High Yes labels : {high : 2, low : 0} labels : {high : 4, low : 2} No labels : {high : 2, low : 2} No increase in information gain possible Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High Yes labels : {high : 2, low : 0} High Mileage labels : {high : 4, low : 2} No labels : {high : 2, low : 2} Find Best Split (weight, {low,high}) (hp, {70, 76, 86, 88, 90, 95 }) Weight == High Yes labels : {high : 2, low : 0} labels : {high : 4, low : 2} No labels : {high : 2, low : 2} High Mileage Still have work to do here.. Let’s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high Let’s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high horsepower 70 86 90 95 Let’s Learn: Rule 2 horsepower weight mileage 95 low low 90 low low 70 low high 86 low high horsepower 70 86 90 95 If horsepower <= 86, mileage is high. Else, it’s low. Mileage Classification Tree labels : {high : 4, low : 2} Weight == High No Yes Low Mileage labels : {high : 2, low : 2} Horsepower <= 86 labels : {high : 2, low : 0} Yes High Mileage labels : {high : 2, low : 0} No Low Mileage labels : {high : 0, low : 2} Let’s predict horsepower weight 90 high 80 low 70 high mileage prediction Let’s predict horsepower weight mileage prediction 90 high low 80 low low 70 high high Let’s predict horsepower weight mileage prediction 90 high low Correct! 80 low low Correct! 70 high high Wrong! Complex in Practice X[0] <= 15.0450 error = 0.660316349195 samples = 569 value = [ 357. 212.] X[1] <= 19.6100 error = 0.383455365224 samples = 397 value = [ 346. 51.] error = 0.1839 samples = 266 value = [ 254. 12.] error = 0.6089 samples = 131 value = [ 92. 39.] X[1] <= 16.3950 error = 0.237709879353 samples = 172 value = [ 11. 161.] error = 0.6931 samples = 18 value = [ 9. 9.] error = 0.0693 samples = 154 value = [ 2. 152.] Why Decision Trees? • Easy to interpret • Handle categorical variables • (Multi-class) classification and regression • No feature scaling • Capture non-linearities and feature interactions • Handle missing values • Ensembles are top performers Overview • Decision Tree 101 • Distributed Decision Trees in MLlib • Experiments • Ensembles • Future work Dataset: Single Machine • Typically, dataset is loaded in memory as a matrix or dataframe • Perform multiple passes over the data • R, scikit-learn, … ho 95 90 70 86 76 88 we mileage lo low lo low lo high lo high hig hig low low Distributed Dataset hp weight mileage hp weight mileage hp weight mileage 95 low low 70 low high 76 high low 90 low low 86 low high 88 high low Distributed Dataset Learn multiple models and combine them hp weight mileage hp weight mileage hp weight mileage 95 low low 70 low high 76 high low 90 low low 86 low high 88 high low Distributed Dataset Learn multiple models and combine them hp weight mileage hp weight mileage hp weight mileage 95 low low 70 low high 76 high low 90 low low 86 low high 88 high low Does not work well for all data partitioning Still need inter-machine communication to combine models Distributed Dataset Distributed Dataset • Hadoop MapReduce • No implementations when we started • Currently: RHadoop, Oryx, OxData,…. Distributed Dataset • • Hadoop MapReduce • No implementations when we started • Currently: RHadoop, Oryx, OxData,…. PLANET • Decision trees using MapReduce • Not open source • Extend with several optimizations Distributed Dataset • • • Hadoop MapReduce • No implementations when we started • Currently: RHadoop, Oryx, OxData,…. PLANET • Decision trees using MapReduce • Not open source • Extend with several optimizations Spark • Iterative machine learning • No trees support in initial versions Split Candidates for Distributed Implementation • • Splits candidates for continuous features • Costly to find all unique feature values • Sorted splits desirable for fast computation • High cardinality of splits leads to significant computation and communication overhead Approximate quantiles (percentiles by default) median (2-quantile) horsepower 70 86 88 90 95 Typical MapReduce Implementation: Algorithm flatMap input: instance output: list(split, label) ! reduceByKey input: split, list(label) output: split, labelHistograms Typical MapReduce Implementation: Example flatMap ! ! ! ! ! ! reduceByKey Typical MapReduce Implementation: Example flatMap ! ! hp ! 76 weight mileage high ! ! ! reduceByKey low Typical MapReduce Implementation: Example flatMap ! ! hp ! 76 weight mileage high ! ! ! reduceByKey low (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low Typical MapReduce Implementation: Example flatMap ! ! hp ! 76 weight mileage high low ! ! ! reduceByKey (weight, high), [low, low] (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low Typical MapReduce Implementation: Example flatMap ! ! hp ! 76 weight mileage high ! ! low (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low ! reduceByKey (weight, high), [low, low] (weight, high), {low: 2, high: 0} Typical MapReduce Implementation: Example flatMap ! ! hp ! 76 weight mileage high ! ! low (weight, high), low (hp, 76), low (hp, 86), low (hp, 88), low (hp, 90), low (hp, 95), low ! reduceByKey (weight, high), [low, low] (weight, high), {low: 2, high: 0} (weight, !high), {low: 2, high: 2} Typical MapReduce Implementation: Issues • For k features, m splits/feature and n instances, the map operation emits O(k*m*n) values per best split computation at a node • • Communication overhead Can we do better? Avoiding Map in MapReduce • • Map operation essential when keys not known • For e.g., words in word count • Splits known in advance No map • avoids object creation overhead • avoids communication overhead due to shuffle Optimization 1: Aggregate (Distributed Fold) Optimization 1: Aggregate (Distributed Fold) RDD partition 1 RDD partition 2 RDD partition N Optimization 1: Aggregate (Distributed Fold) Empty Array Agg 1 Agg 2 Agg N Fold (Reduce) RDD partition 1 RDD partition 2 RDD partition N Partial Statistics Agg 1 Agg 2 Agg N Optimization 1: Aggregate (Distributed Fold) Empty Array Agg 1 Agg 2 Agg N Fold (Reduce) RDD partition 1 RDD partition 2 RDD partition N Partial Statistics Agg 1 Agg 2 Agg N Fold (Reduce) Sufficient Statistics X Agg Sufficient Statistics • Left and right child node statistics for each split • Classification: label counts • Regression: count, sum, sum^2 Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning 70 hp mileage 95 low 90 low 70 high 86 high 86 90 horsepower label histograms Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 86 90 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 low: 1 high: 0 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 86 90 low: 1 high: 0 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 90 86 low: 1 high: 0 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 90 86 low: 1 high: 0 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 90 86 low: 1 high: 0 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 low: 0 high: 1 label histograms low: 0 high: 2 low: 1 high: 2 90 86 low: 1 high: 0 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 low: 0 high: 1 low: 0 high: 1 label histograms 90 86 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 low: 0 high: 1 low: 0 high: 1 label histograms 90 86 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Optimization 2: Binning hp mileage 95 low 90 low 70 high 86 high 70 low: 0 high: 1 low: 0 high: 1 low: 0 high: 1 label histograms 90 86 low: 0 high: 2 low: 1 high: 2 low: 1 high: 0 low: 1 high: 0 low: 2 high: 1 low: 2 high: 0 low: 1 high: 0 horsepower Bin-based Info Gain m splits per feature ! Binning using binary search log(m) versus m ! Bin histogram update factor of m savings Significant savings in computation Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Optimization 3: Level-wise training Node 1 Split candidates Node 2 Cached Dataset Optimization 3: Level-wise training Node 1 Split candidates Node 2 Cached Dataset Node 3 Optimization 3: Level-wise training Node 1 Split candidates Node 2 Cached Dataset Node 3 Optimization 3: Level-wise training Node 1 Split candidates Cached Dataset Node 2 Node 3 Perform level-wise training of nodes Level-wise training • • • • L passes instead of 2L - 1 for full tree Depth 4: 4 passes instead of 15 Depth 10: 10 passes instead of 1023 …. 1 2 3 4 8 5 9 10 6 11 12 7 13 14 15 MLlib decision tree features • Binary classification and regression (1.0) • Categorical variable support (1.0) • Arbitrarily deep trees (1.1) • Multiclass classification* (under review for 1.1) • Sample weights* (under review for 1.1) Overview • Decision Tree 101 • Distributed Decision Trees in MLlib • Experiments • Ensembles • Future work Strong Scaling Experiment Ideal 8 Baseline: 20m samples with 20 features on 2 workers 6 Speedup Experiment 4 2 0 2 4 8 Machines 16 Strong Scaling Experiment Ideal 8 7.22 Baseline: 20m samples with 20 features on 2 workers 6 Speedup Experiment 4 3.7 2 1.84 0 1 2 4 8 Machines 16 Strong Scaling Results • Synthetic dataset ! • 10 to 50 million instances ! • 10 to 50 features ! • 2 to 16 machines ! • 700 MB to 18 GB dataset ! • Average speedup from 2 to 16 machines was 6.6X! Large-scale experiment Training Time (s) 0.5 billion instances, 20 features, 90 GB dataset Depth 3 Depth 5 Depth 10 10000 1000 100 30 60 100 150 Machines 200 300 600 Large-scale experiment Works 0.5 billionon instances, 20 features, 90 GB dataset Training Time (s) large datasets Depth 3 Depth 5 Depth 10 10000 1000 100 30 60 100 150 Machines 200 300 600 Large-scale experiment Training Time (s) Deep trees Works 0.5 billionon instances, 20 features, 90 GB dataset require more large datasets comp. Depth 3 Depth 5 Depth 10 10000 1000 100 30 60 100 150 Machines 200 300 600 Large-scale experiment Training Time (s) Deep trees Comp. Works 0.5 billionon instances, 20 features, 90 GB dataset require more vs large datasets comp. Comm. Depth 3 Depth 5 Depth 10 10000 1000 100 30 60 100 150 Machines 200 300 600 Overview • Decision Tree 101 • Distributed Decision Trees in MLlib • Experiments • Ensembles • Future work Tree Ensembles • Decision trees are building blocks • Boosting • • sequential • sample weight Random Forests • parallel construction • level-wise training extension to multiple trees AdaBoost wrapper AdaBoost wrapper AdaBoost wrapper Overview • Decision Tree 101 • Distributed Decision Trees in MLlib • Experiments • Ensembles • Future work Future Work • Ensembles (stretch goal for 1.1) • Feature importances • Decision tree visualizations • Testing over a variety of user datasets Requests Requests Requests Requests ho 95 90 70 86 76 88 we mileage lo low lo low lo high lo high hig hig low low Requests ho 95 90 70 86 76 88 we mileage lo low lo low lo high lo high hig hig low low Thanks