决策树在 Spark 平台上的优化策略

Scalable Distributed
Decision Trees in
Spark MLlib
Manish Amde, Origami Logic
Hirakendu Das, Yahoo! Labs
Evan Sparks, UC Berkeley
Ameet Talwalkar, UC Berkeley
Who is this guy?
•
Ph.D. in ECE from UC San Diego
•
Currently, data science at Origami Logic
•
Origami Logic
•
Search-based marketing intelligence platform
•
Work with global brands on large, unstructured
marketing datasets
•
Using Spark for analytics
Overview
•
Decision Tree 101
•
Distributed Decision Trees in MLlib
•
Experiments
•
Ensembles
•
Future work
Supervised Learning
Train
Features
Features
Features
Features
Features
Predict
Features
Label
Label
Label
Label
Label
?
Classification / Regression
Features
Features
Features
Features
Features
•
Classification
• Labels denote classes
!
•
Regression
• Labels denote real numbers
Label
Label
Label
Label
Label
Car mileage from 1971!
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
76
high
low
88
high
low
Learn a model to predict the mileage
(Binary classification!)
Let’s Learn: Rule 1
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
76
high
low
88
high
low
Let’s Learn: Rule 1
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
76
high
low
88
high
low
Let’s Learn: Rule 1
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
76
high
low
88
high
low
If weight is high, mileage is low
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
labels : {high : 4, low : 2}
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
labels : {high : 4, low : 2}
Training Data
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Split Candidates
labels : {high : 4, low : 2}
Training Data
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
labels : {high : 4, low : 2}
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Weight == High
Yes
labels : {high : 2, low : 0}
labels : {high : 4, low : 2}
No
labels : {high : 2, low : 2}
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Weight == High
Yes
labels : {high : 2, low : 0}
labels : {high : 4, low : 2}
No
labels : {high : 2, low : 2}
Chose a split that causes maximum reduction in the label variability
!
Chose a split that maximizes “information gain”
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Weight == High
Yes
labels : {high : 2, low : 0}
labels : {high : 4, low : 2}
No
labels : {high : 2, low : 2}
No increase in information gain possible
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Weight == High
Yes
labels : {high : 2, low : 0}
High Mileage
labels : {high : 4, low : 2}
No
labels : {high : 2, low : 2}
Find Best Split
(weight, {low,high})
(hp, {70, 76, 86, 88, 90, 95 })
Weight == High
Yes
labels : {high : 2, low : 0}
labels : {high : 4, low : 2}
No
labels : {high : 2, low : 2}
High Mileage
Still have work to do here..
Let’s Learn: Rule 2
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
Let’s Learn: Rule 2
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
horsepower
70
86
90
95
Let’s Learn: Rule 2
horsepower
weight
mileage
95
low
low
90
low
low
70
low
high
86
low
high
horsepower
70
86
90
95
If horsepower <= 86, mileage is high. Else, it’s low.
Mileage Classification Tree
labels : {high : 4, low : 2}
Weight == High
No
Yes
Low Mileage
labels : {high : 2, low : 2}
Horsepower <= 86
labels : {high : 2, low : 0}
Yes
High Mileage
labels : {high : 2, low : 0}
No
Low Mileage
labels : {high : 0, low : 2}
Let’s predict
horsepower
weight
90
high
80
low
70
high
mileage prediction
Let’s predict
horsepower
weight
mileage prediction
90
high
low
80
low
low
70
high
high
Let’s predict
horsepower
weight
mileage prediction
90
high
low
Correct!
80
low
low
Correct!
70
high
high
Wrong!
Complex in Practice
X[0] <= 15.0450
error = 0.660316349195
samples = 569
value = [ 357. 212.]
X[1] <= 19.6100
error = 0.383455365224
samples = 397
value = [ 346. 51.]
error = 0.1839
samples = 266
value = [ 254. 12.]
error = 0.6089
samples = 131
value = [ 92. 39.]
X[1] <= 16.3950
error = 0.237709879353
samples = 172
value = [ 11. 161.]
error = 0.6931
samples = 18
value = [ 9. 9.]
error = 0.0693
samples = 154
value = [ 2. 152.]
Why Decision Trees?
•
Easy to interpret
•
Handle categorical variables
•
(Multi-class) classification and regression
•
No feature scaling
•
Capture non-linearities and feature interactions
•
Handle missing values
•
Ensembles are top performers
Overview
•
Decision Tree 101
•
Distributed Decision Trees in MLlib
•
Experiments
•
Ensembles
•
Future work
Dataset: Single Machine
•
Typically, dataset is
loaded in memory as a
matrix or dataframe
•
Perform multiple passes
over the data
•
R, scikit-learn, …
ho
95
90
70
86
76
88
we
mileage
lo
low
lo
low
lo
high
lo high
hig
hig low
low
Distributed Dataset
hp
weight mileage
hp
weight mileage
hp
weight mileage
95
low
low
70
low
high
76
high
low
90
low
low
86
low
high
88
high
low
Distributed Dataset
Learn multiple models and combine them
hp
weight mileage
hp
weight mileage
hp
weight mileage
95
low
low
70
low
high
76
high
low
90
low
low
86
low
high
88
high
low
Distributed Dataset
Learn multiple models and combine them
hp
weight mileage
hp
weight mileage
hp
weight mileage
95
low
low
70
low
high
76
high
low
90
low
low
86
low
high
88
high
low
Does not work well for all data partitioning
Still need inter-machine communication to combine models
Distributed Dataset
Distributed Dataset
•
Hadoop MapReduce
• No implementations when we started
• Currently: RHadoop, Oryx, OxData,….
Distributed Dataset
•
•
Hadoop MapReduce
• No implementations when we started
• Currently: RHadoop, Oryx, OxData,….
PLANET
• Decision trees using MapReduce
• Not open source
• Extend with several optimizations
Distributed Dataset
•
•
•
Hadoop MapReduce
• No implementations when we started
• Currently: RHadoop, Oryx, OxData,….
PLANET
• Decision trees using MapReduce
• Not open source
• Extend with several optimizations
Spark
• Iterative machine learning
• No trees support in initial versions
Split Candidates for
Distributed Implementation
•
•
Splits candidates for continuous features
•
Costly to find all unique feature values
•
Sorted splits desirable for fast computation
•
High cardinality of splits leads to significant computation
and communication overhead
Approximate quantiles (percentiles by default)
median
(2-quantile)
horsepower
70
86 88 90
95
Typical MapReduce
Implementation: Algorithm
flatMap
input: instance
output: list(split, label)
!
reduceByKey
input: split, list(label)
output: split, labelHistograms
Typical MapReduce
Implementation: Example
flatMap
!
!
!
!
!
!
reduceByKey
Typical MapReduce
Implementation: Example
flatMap
!
! hp
! 76
weight mileage
high
!
!
!
reduceByKey
low
Typical MapReduce
Implementation: Example
flatMap
!
! hp
! 76
weight mileage
high
!
!
!
reduceByKey
low
(weight, high), low
(hp, 76), low
(hp, 86), low
(hp, 88), low
(hp, 90), low
(hp, 95), low
Typical MapReduce
Implementation: Example
flatMap
!
! hp
! 76
weight mileage
high
low
!
!
!
reduceByKey
(weight, high), [low, low]
(weight, high), low
(hp, 76), low
(hp, 86), low
(hp, 88), low
(hp, 90), low
(hp, 95), low
Typical MapReduce
Implementation: Example
flatMap
!
! hp
! 76
weight mileage
high
!
!
low
(weight, high), low
(hp, 76), low
(hp, 86), low
(hp, 88), low
(hp, 90), low
(hp, 95), low
!
reduceByKey
(weight, high), [low, low] (weight, high), {low: 2, high: 0}
Typical MapReduce
Implementation: Example
flatMap
!
! hp
! 76
weight mileage
high
!
!
low
(weight, high), low
(hp, 76), low
(hp, 86), low
(hp, 88), low
(hp, 90), low
(hp, 95), low
!
reduceByKey
(weight, high), [low, low] (weight, high), {low: 2, high: 0}
(weight, !high), {low: 2, high: 2}
Typical MapReduce
Implementation: Issues
•
For k features, m splits/feature and n instances,
the map operation emits O(k*m*n) values per
best split computation at a node
•
•
Communication overhead
Can we do better?
Avoiding Map in
MapReduce
•
•
Map operation essential when keys not known
•
For e.g., words in word count
•
Splits known in advance
No map
•
avoids object creation overhead
•
avoids communication overhead due to shuffle
Optimization 1:
Aggregate (Distributed Fold)
Optimization 1:
Aggregate (Distributed Fold)
RDD
partition 1
RDD
partition 2
RDD
partition N
Optimization 1:
Aggregate (Distributed Fold)
Empty
Array
Agg 1
Agg 2
Agg N
Fold
(Reduce)
RDD
partition 1
RDD
partition 2
RDD
partition N
Partial
Statistics
Agg 1
Agg 2
Agg N
Optimization 1:
Aggregate (Distributed Fold)
Empty
Array
Agg 1
Agg 2
Agg N
Fold
(Reduce)
RDD
partition 1
RDD
partition 2
RDD
partition N
Partial
Statistics
Agg 1
Agg 2
Agg N
Fold
(Reduce)
Sufficient
Statistics
X
Agg
Sufficient Statistics
•
Left and right child node statistics for each split
•
Classification: label counts
•
Regression: count, sum, sum^2
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
70
hp
mileage
95
low
90
low
70
high
86
high
86
90
horsepower
label
histograms
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
86
90
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
86
90
low: 1
high: 0
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
86
90
low: 1
high: 0
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
90
86
low: 1
high: 0
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
90
86
low: 1
high: 0
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
90
86
low: 1
high: 0
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
low: 0
high: 1
label
histograms
low: 0
high: 2
low: 1
high: 2
90
86
low: 1
high: 0
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
low: 0
high: 1
low: 0
high: 1
label
histograms
90
86
low: 0
high: 2
low: 1
high: 2
low: 1
high: 0
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
low: 0
high: 1
low: 0
high: 1
label
histograms
90
86
low: 0
high: 2
low: 1
high: 2
low: 1
high: 0
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Optimization 2: Binning
hp
mileage
95
low
90
low
70
high
86
high
70
low: 0
high: 1
low: 0
high: 1
low: 0
high: 1
label
histograms
90
86
low: 0
high: 2
low: 1
high: 2
low: 1
high: 0
low: 1
high: 0
low: 2
high: 1
low: 2
high: 0
low: 1
high: 0
horsepower
Bin-based Info Gain
m splits per feature
!
Binning using binary search
log(m) versus m
!
Bin histogram update
factor of m savings
Significant savings in computation
Optimization 3:
Level-wise training
Node 1
Split candidates
Cached Dataset
Optimization 3:
Level-wise training
Node 1
Split candidates
Node 2
Cached Dataset
Optimization 3:
Level-wise training
Node 1
Split candidates
Node 2
Cached Dataset
Node 3
Optimization 3:
Level-wise training
Node 1
Split candidates
Node 2
Cached Dataset
Node 3
Optimization 3:
Level-wise training
Node 1
Split candidates
Cached Dataset
Node
2
Node
3
Perform
level-wise
training
of nodes
Level-wise training
•
•
•
•
L passes instead of 2L - 1 for full tree
Depth 4: 4 passes instead of 15
Depth 10: 10 passes instead of 1023
….
1
2
3
4
8
5
9
10
6
11
12
7
13
14
15
MLlib decision tree features
•
Binary classification and regression (1.0)
•
Categorical variable support (1.0)
•
Arbitrarily deep trees (1.1)
•
Multiclass classification* (under review for 1.1)
•
Sample weights* (under review for 1.1)
Overview
•
Decision Tree 101
•
Distributed Decision Trees in MLlib
•
Experiments
•
Ensembles
•
Future work
Strong Scaling Experiment
Ideal
8
Baseline: 20m samples with
20 features on 2 workers
6
Speedup
Experiment
4
2
0
2
4
8
Machines
16
Strong Scaling Experiment
Ideal
8
7.22
Baseline: 20m samples with
20 features on 2 workers
6
Speedup
Experiment
4
3.7
2
1.84
0
1
2
4
8
Machines
16
Strong Scaling Results
•
Synthetic dataset
!
•
10 to 50 million instances
!
•
10 to 50 features
!
•
2 to 16 machines
!
•
700 MB to 18 GB dataset
!
•
Average speedup from 2 to 16 machines was 6.6X!
Large-scale experiment
Training Time (s)
0.5 billion instances, 20 features, 90 GB dataset
Depth 3
Depth 5
Depth 10
10000
1000
100
30
60
100
150
Machines
200
300
600
Large-scale experiment
Works
0.5
billionon
instances, 20 features, 90 GB dataset
Training Time (s)
large datasets
Depth 3
Depth 5
Depth 10
10000
1000
100
30
60
100
150
Machines
200
300
600
Large-scale experiment
Training Time (s)
Deep
trees
Works
0.5
billionon
instances, 20 features, 90 GB dataset
require
more
large datasets
comp.
Depth 3
Depth 5
Depth 10
10000
1000
100
30
60
100
150
Machines
200
300
600
Large-scale experiment
Training Time (s)
Deep
trees
Comp.
Works
0.5
billionon
instances, 20 features, 90 GB dataset
require
more
vs
large datasets
comp.
Comm.
Depth 3
Depth 5
Depth 10
10000
1000
100
30
60
100
150
Machines
200
300
600
Overview
•
Decision Tree 101
•
Distributed Decision Trees in MLlib
•
Experiments
•
Ensembles
•
Future work
Tree Ensembles
•
Decision trees are building blocks
•
Boosting
•
•
sequential
•
sample weight
Random Forests
•
parallel construction
•
level-wise training extension to multiple trees
AdaBoost wrapper
AdaBoost wrapper
AdaBoost wrapper
Overview
•
Decision Tree 101
•
Distributed Decision Trees in MLlib
•
Experiments
•
Ensembles
•
Future work
Future Work
•
Ensembles (stretch goal for 1.1)
•
Feature importances
•
Decision tree visualizations
•
Testing over a variety of user datasets
Requests
Requests
Requests
Requests
ho
95
90
70
86
76
88
we
mileage
lo
low
lo
low
lo
high
lo high
hig
hig low
low
Requests
ho
95
90
70
86
76
88
we
mileage
lo
low
lo
low
lo
high
lo high
hig
hig low
low
Thanks