Slides

Scalable and Numerically Stable Descriptive
Statistics in SystemML
Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald
IBM Almaden Research Center
© 2012 IBM Corporation
SystemML
Pervasive need to enable machine learning (ML) on massive
datasets
Increasing interest in implementing a ML algorithms on
MapReduce
Directly implementing ML algorithms on MapReduce is
challenging
Solution: SystemML – A Declarative Approach to Machine
Learning on MapReduce
2
© 2012 IBM Corporation
SystemML Overview
High-level language with ML specific
constructs
•
Syntax is similar to R and Matlab
Matrix, vector and scalar data types
Linear algebra and mathematics operators
PageRank
G=read("in/G", rows=1e6, cols=1e6);
p=read("in/p", rows=1e6, cols=1);
e=read("in/e", rows=1e6, cols=1);
ut=read("in/ut", rows=1, cols=1e6);
alpha=0.85;
max_iteration=20;
i=0;
while(i<max_iteration){
p=alpha*(G%*%p)+(1-alpha)*(e%*%ut%*%p);
i=i+1;}
Language
Optimizations based on data and
system characteristics
Cost-based and rule-based optimization
• Job generation heuristics
Optimizations
•
…
Scalable Operator
Implementations
Scalable implementations on Hadoop
Handle large sparse data sets
• Parallelization is transparent to end users
•
MR1
MR2
MRn
Hadoop
ICDE 2011
3
© 2012 IBM Corporation
Example ML Algorithms Supported in SystemML
Classification: linear SVMs, logistic regression
Regression: linear regression
Matrix Factorization: NMF, SVD, PCA
Clustering: k-means
Ranking: PageRank, HITS
Data Exploration: Descriptive Statistics
•
Univariate Statistics:
•
Scale: Sum, Mean, Harmonic mean, Geometric mean, Min, Max, Range, Median, Quantile, Interquartile-mean, Variance, Standard deviation, Coefficient of variance, Central moment, Skewness,
Kurtosis, Standard error of mean, Standard error of skewness, Standard error of kurtosis
Categorical: Mode, Per-category frequencies
Bivariate Statistics:
4
focus of this talk
Scale-scale: Covariance, Pearson correlation
Scale-categorical: Eta, ANOVA F measure
Categorical-categorical: Chi-square coefficient, Cramer’s V, Spearman correlation
© 2012 IBM Corporation
How hard is it?
Seemingly trivial to implement in MapReduce
•
Most descriptive statistics can be written in certain summation form
Sum
Mean
Variance
Pitfall: these straight forward implementations can lead to
disasters in numerical accuracy
Problem gets worse with increasing volumes of data
5
© 2012 IBM Corporation
Background: Floating Point Numbers
Source of Inaccuracy: finite precision arithmetic for floating point
numbers
Floating point number system F ⊂ R: y = ±βe × .d1d2 . . . dt (0 ≤ di ≤ β−1)
•
•
•
•
base β,
precision t ,
exponent range emin ≤ e ≤ emax
IEEE double: β=2, t=53, emin =-1021, emax =1024
Floating point numbers are not equally spaced.
•
Example number system: β = 2, t = 3, emin = −1, and emax = 3
F={ 0, ±0.25, ±0.3125, ±0.375, ±0.4375, ±0.5, ±0.625, ±0.75, ±0.875, ±1.0,
±1.25, ±1.5, ±1.75, ±2.0, ±2.5, ±3.0, ±3.5, ±4.0, ±5.0, ±6.0, ±7.0 }
0.0
6
1.0
2.0
3.0
4.0
5.0
6.0
7.0
© 2012 IBM Corporation
Background: Numerical Accuracy
Example number system: β = 2, t = 3, emin = −1, and emax = 3
F={0, ±0.25, ±0.3125, ±0.375, ±0.4375, ±0.5, ±0.625, ±0.75, ±0.875,
±1.0, ±1.25, ±1.5, ±1.75, ±2.0, ±2.5, ±3.0, ±3.5, ±4.0, ±5.0, ±6.0, ±7.0 }
relative error = |x-x’|/|x|
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
• x: true value
• x’: value in float point
round off: 6.375 and 5.625
true value: 6.375 and 5.625
computed: 6.0 and 6.0
relative error: 0.059 and 0.067
big number + many small numbers: 5.0 + 0.25 + 0.25 + … + 0.25 (eight 0.25)
true value: 7.0
computed: 5.0
pitfall for summation
relative error: 0.286
5.0 + 0.25 = 5.25 round to 5.0
subtraction of 2 big and similar numbers: 6.375 – 5.625 catastrophic cancellation for subtraction
true value: 0.75
computed: 0
relative error: 1.0
6.375 round to 6.0
5.625 round to 6.0
7
© 2012 IBM Corporation
Background: Numerical Stability
Algebraically equivalent algorithms for the same calculation
produce different results on digital computers
•
•
Some damp out errors
Some magnify errors
Sum: 5.0 + 0.25 + 0.25 + … + 0.25 (eight 0.25s)
Naïve
Naïve
Recursive
Recursive
5.0
5.0
+0.25
+0.25
+0.25
+0.25
….
….
5.0
….
+0.5
7.0
Numerical stability is a desirable property of numerical
algorithms
•
8
=
≠
Sort
Recursive
Ordered
0.25
Recursive
+0.25
0.25
….
+0.25
+0.5
Algorithms that can be proven not to magnify approximation errors are
called numerically stable
© 2012 IBM Corporation
Importance of Numerical Stability
Numerical stability issue has been largely ignored in big data
processing
•
e.g. PIG and HIVE, are using well-known unstable algorithms for
computing some basic statistics
How about software floating point packages, e.g. BigDecimal?
•
Arbitrary precision, but very slow
+, -, *: 2 orders of magnitude slower
/: 5 orders of magnitude slower
Goal of this Talk: share our experience on descriptive
statistics algorithms for big data
Scalable – database people already understand
• Numerically Stable – need more attention!!!!!
•
9
© 2012 IBM Corporation
Numerically Stable Summation
Naïve Recursive: unstable
Sorted Recursive: better for nonnegative but needs an expensive sort
Kahan: efficient and stable [Kahan 1965]
Recursive summation with a correction term to compensate rounding error
(s’, c’) = KahanAdd (s, c , a)
s/s’: old/new sum
a’= a + c
Stability Property: relative error bound
Kahan: 5.0 + 0.25 + 0.25 + … + 0.25
c/c’: old/new correct
s’= s + a’
sum
correction
independent of problem size n, when n is less
a: number to add
16
c’ = a’- (s’- s) than O(10 ) for IEEE doubles
5.0
0
•
MR Kahan in SystemML:
•
•
Mapper: apply Kahan to compute partial sum and correction
Reducer: apply Kahan on partial results to compute sum
(s, c) = KahanAdd (s1, c1+c2, s2)
s1/s2: partial sums
c1/c2: partial corrects
s: total sum
c: total correct
Our Proof: relative error bound independent
of problem size n, as long as each mapper
processes less than O(1016) numbers
MR Kahan scales to larger problem size with numerical accuracy.
10
+ 0.25
5.0
0.25
+ 0.25
6.0
-0.5
+ 0.25
6.0
-0.25
+ 0.25
6.0
0
…….
© 2012 IBM Corporation
Numerically Stable Mean
Naïve sum/count: unstable
Incremental: stable [Chan et al 1979]
n1, n2: partial counts, µ1, µ2: partial means
n = n 1 + n2
δ = µ 2 - µ1
µ = µ1 + δn2/n
MR Incremental: adapt Incremental to MapReduce
n = n 1 + n2
δ = µ 2 - µ1
µ = µ1 ¤ δn2/n (¤ -- KahanAdd, maintain a correction term for µ)
11
© 2012 IBM Corporation
Numerically Stable Higher-Order Statistics
Higher order statistics: central moment, variance, standard deviation,
skewness, kurtosis (core: central moment
)
2-Pass: 1st pass to compute mean, 2nd pass to compute mp
•
Textbook 1-Pass: textbook rewrites
•
•
Stable, but needs 2 scans of data
Notoriously unstable (due to catastrophic cancellation), can even produce
negative result for mp when p%2=0
Unfortunately, widely used in practice
Incremental: stable* & 1 pass [Bennett et al 2009]
¤
¤
MR Incremental: adapt Incremental to MapReduce
•
12
¤
Use KahanAdd
© 2012 IBM Corporation
Numerically Stable Covariance
2-Pass: 1st pass to compute means, 2nd pass to compute covariance
•
Textbook 1-Pass: textbook rewrites
•
•
Stable, but needs 2 scans of data
Notoriously unstable (catastrophic cancellation)
Unfortunately, widely used in practice
Incremental: practically stable & 1 pass [Bennett et al 2009]
¤
¤
¤ ¤
MR Incremental: adapt Incremental to MapReduce
•
13
Use KahanAdd
© 2012 IBM Corporation
Experiment Results
Example Univariate Statistics
Size
(million)
Sum
Range
R2
R3
Ranges : R1= [1.0 – 1.5), R2= [1000.0 – 1000.5), R3= [1000000.0 – 1000000.5)
Mean
Variance
SML
Naïve
SML
Naïve
SML
Textbook
SML
Textbook
10
16.8
14.4
16.5
14.4
15.4
5.9
15.9
6.2
100
16.1
13.4
16.9
13.4
15.6
5.3
15.8
5.6
1000
16.6
13.1
16.4
13.1
16.2
4.9
16.4
5.2
10
15.9
14.0
16.3
14.0
14.4
0
14.7
0
100
16.0
13.1
16.9
13.1
12.9
Negative
13.2
NA
1000
16.3
12.9
16.5
12.9
13.2
Negative
13.5
NA
Significantly better accuracy!
No sacrifice to performance!
Example Bivariate Statistics
Size
(million)
Range
R1 vs R2
R2 vs R3
14
Standard Deviation
CoVariance
Pearson-R
Data Sets: uniform distribution in 3
ranges
SML
Textbook
SML
Textbook
10
15.0
8.4
15.1
6.2
•
100
15.6
8.5
15.4
6.4
•
1000
16.0
8.7
15.7
6.2
10
13.5
3.0
13.5
3.0
100
12.8
2.8
12.7
NA
1000
13.6
3.9
13.8
NA
R1= [1.0 – 1.5), R2= [1000.0 – 1000.5),
R3= [1000000.0 – 1000000.5)
modeled after NIST StRD benchmark
Accuracy Measure:
•
•
•
LRE = log(relative error)
# significant digits that match between
the computed value and the true value.
true value: produced by BigDecimal with
precision 1000.
© 2012 IBM Corporation
Lessons Learned (1/2)
Many existing numerical stable techniques can be adapted to
the distributed environment
Divide-and-conquer design helps in scaling to larger data sets
while achieving good numerical accuracy
•
e.g. MR Kahan can handle more data than Kahan with numerical
accuracy
Kahan technique is useful beyond simple summation
R1 10 million points
R1 vs R2 10 million points
Variance
15
Std
Covariance
Pearson-R
¤
+
¤
+
¤
+
¤
+
16.0
13.5
15.9
13.8
15.0
14.2
15.1
13.0
© 2012 IBM Corporation
Lessons Learned (2/2)
Shifting can be used for improved accuracy
Accuracy degrades as magnitude of values increases
• Achieve better accuracy by shifting the data points by a constant prior
to computation
•
10 million points
Sum
16
Mean
Variance
Standard Deviation
Range
SML
Naïve
SML
Naïve
SML
Textbook
SML
Textbook
1000 – 1000.5
16.8
14.4
16.5
14.4
15.4
5.9
15.9
6.2
1000,000 – 1000,000.5
15.9
14.0
16.3
14.0
14.4
0
14.7
0
Performance need not be sacrificed for accuracy
© 2012 IBM Corporation
Recommended Reading:
•
17
Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms.
SIAM, 2nd edition, 2002.
Thanks! and Questions?
© 2012 IBM Corporation