Scalable and Numerically Stable Descriptive Statistics in SystemML Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald IBM Almaden Research Center © 2012 IBM Corporation SystemML Pervasive need to enable machine learning (ML) on massive datasets Increasing interest in implementing a ML algorithms on MapReduce Directly implementing ML algorithms on MapReduce is challenging Solution: SystemML – A Declarative Approach to Machine Learning on MapReduce 2 © 2012 IBM Corporation SystemML Overview High-level language with ML specific constructs • Syntax is similar to R and Matlab Matrix, vector and scalar data types Linear algebra and mathematics operators PageRank G=read("in/G", rows=1e6, cols=1e6); p=read("in/p", rows=1e6, cols=1); e=read("in/e", rows=1e6, cols=1); ut=read("in/ut", rows=1, cols=1e6); alpha=0.85; max_iteration=20; i=0; while(i<max_iteration){ p=alpha*(G%*%p)+(1-alpha)*(e%*%ut%*%p); i=i+1;} Language Optimizations based on data and system characteristics Cost-based and rule-based optimization • Job generation heuristics Optimizations • … Scalable Operator Implementations Scalable implementations on Hadoop Handle large sparse data sets • Parallelization is transparent to end users • MR1 MR2 MRn Hadoop ICDE 2011 3 © 2012 IBM Corporation Example ML Algorithms Supported in SystemML Classification: linear SVMs, logistic regression Regression: linear regression Matrix Factorization: NMF, SVD, PCA Clustering: k-means Ranking: PageRank, HITS Data Exploration: Descriptive Statistics • Univariate Statistics: • Scale: Sum, Mean, Harmonic mean, Geometric mean, Min, Max, Range, Median, Quantile, Interquartile-mean, Variance, Standard deviation, Coefficient of variance, Central moment, Skewness, Kurtosis, Standard error of mean, Standard error of skewness, Standard error of kurtosis Categorical: Mode, Per-category frequencies Bivariate Statistics: 4 focus of this talk Scale-scale: Covariance, Pearson correlation Scale-categorical: Eta, ANOVA F measure Categorical-categorical: Chi-square coefficient, Cramer’s V, Spearman correlation © 2012 IBM Corporation How hard is it? Seemingly trivial to implement in MapReduce • Most descriptive statistics can be written in certain summation form Sum Mean Variance Pitfall: these straight forward implementations can lead to disasters in numerical accuracy Problem gets worse with increasing volumes of data 5 © 2012 IBM Corporation Background: Floating Point Numbers Source of Inaccuracy: finite precision arithmetic for floating point numbers Floating point number system F ⊂ R: y = ±βe × .d1d2 . . . dt (0 ≤ di ≤ β−1) • • • • base β, precision t , exponent range emin ≤ e ≤ emax IEEE double: β=2, t=53, emin =-1021, emax =1024 Floating point numbers are not equally spaced. • Example number system: β = 2, t = 3, emin = −1, and emax = 3 F={ 0, ±0.25, ±0.3125, ±0.375, ±0.4375, ±0.5, ±0.625, ±0.75, ±0.875, ±1.0, ±1.25, ±1.5, ±1.75, ±2.0, ±2.5, ±3.0, ±3.5, ±4.0, ±5.0, ±6.0, ±7.0 } 0.0 6 1.0 2.0 3.0 4.0 5.0 6.0 7.0 © 2012 IBM Corporation Background: Numerical Accuracy Example number system: β = 2, t = 3, emin = −1, and emax = 3 F={0, ±0.25, ±0.3125, ±0.375, ±0.4375, ±0.5, ±0.625, ±0.75, ±0.875, ±1.0, ±1.25, ±1.5, ±1.75, ±2.0, ±2.5, ±3.0, ±3.5, ±4.0, ±5.0, ±6.0, ±7.0 } relative error = |x-x’|/|x| 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 • x: true value • x’: value in float point round off: 6.375 and 5.625 true value: 6.375 and 5.625 computed: 6.0 and 6.0 relative error: 0.059 and 0.067 big number + many small numbers: 5.0 + 0.25 + 0.25 + … + 0.25 (eight 0.25) true value: 7.0 computed: 5.0 pitfall for summation relative error: 0.286 5.0 + 0.25 = 5.25 round to 5.0 subtraction of 2 big and similar numbers: 6.375 – 5.625 catastrophic cancellation for subtraction true value: 0.75 computed: 0 relative error: 1.0 6.375 round to 6.0 5.625 round to 6.0 7 © 2012 IBM Corporation Background: Numerical Stability Algebraically equivalent algorithms for the same calculation produce different results on digital computers • • Some damp out errors Some magnify errors Sum: 5.0 + 0.25 + 0.25 + … + 0.25 (eight 0.25s) Naïve Naïve Recursive Recursive 5.0 5.0 +0.25 +0.25 +0.25 +0.25 …. …. 5.0 …. +0.5 7.0 Numerical stability is a desirable property of numerical algorithms • 8 = ≠ Sort Recursive Ordered 0.25 Recursive +0.25 0.25 …. +0.25 +0.5 Algorithms that can be proven not to magnify approximation errors are called numerically stable © 2012 IBM Corporation Importance of Numerical Stability Numerical stability issue has been largely ignored in big data processing • e.g. PIG and HIVE, are using well-known unstable algorithms for computing some basic statistics How about software floating point packages, e.g. BigDecimal? • Arbitrary precision, but very slow +, -, *: 2 orders of magnitude slower /: 5 orders of magnitude slower Goal of this Talk: share our experience on descriptive statistics algorithms for big data Scalable – database people already understand • Numerically Stable – need more attention!!!!! • 9 © 2012 IBM Corporation Numerically Stable Summation Naïve Recursive: unstable Sorted Recursive: better for nonnegative but needs an expensive sort Kahan: efficient and stable [Kahan 1965] Recursive summation with a correction term to compensate rounding error (s’, c’) = KahanAdd (s, c , a) s/s’: old/new sum a’= a + c Stability Property: relative error bound Kahan: 5.0 + 0.25 + 0.25 + … + 0.25 c/c’: old/new correct s’= s + a’ sum correction independent of problem size n, when n is less a: number to add 16 c’ = a’- (s’- s) than O(10 ) for IEEE doubles 5.0 0 • MR Kahan in SystemML: • • Mapper: apply Kahan to compute partial sum and correction Reducer: apply Kahan on partial results to compute sum (s, c) = KahanAdd (s1, c1+c2, s2) s1/s2: partial sums c1/c2: partial corrects s: total sum c: total correct Our Proof: relative error bound independent of problem size n, as long as each mapper processes less than O(1016) numbers MR Kahan scales to larger problem size with numerical accuracy. 10 + 0.25 5.0 0.25 + 0.25 6.0 -0.5 + 0.25 6.0 -0.25 + 0.25 6.0 0 ……. © 2012 IBM Corporation Numerically Stable Mean Naïve sum/count: unstable Incremental: stable [Chan et al 1979] n1, n2: partial counts, µ1, µ2: partial means n = n 1 + n2 δ = µ 2 - µ1 µ = µ1 + δn2/n MR Incremental: adapt Incremental to MapReduce n = n 1 + n2 δ = µ 2 - µ1 µ = µ1 ¤ δn2/n (¤ -- KahanAdd, maintain a correction term for µ) 11 © 2012 IBM Corporation Numerically Stable Higher-Order Statistics Higher order statistics: central moment, variance, standard deviation, skewness, kurtosis (core: central moment ) 2-Pass: 1st pass to compute mean, 2nd pass to compute mp • Textbook 1-Pass: textbook rewrites • • Stable, but needs 2 scans of data Notoriously unstable (due to catastrophic cancellation), can even produce negative result for mp when p%2=0 Unfortunately, widely used in practice Incremental: stable* & 1 pass [Bennett et al 2009] ¤ ¤ MR Incremental: adapt Incremental to MapReduce • 12 ¤ Use KahanAdd © 2012 IBM Corporation Numerically Stable Covariance 2-Pass: 1st pass to compute means, 2nd pass to compute covariance • Textbook 1-Pass: textbook rewrites • • Stable, but needs 2 scans of data Notoriously unstable (catastrophic cancellation) Unfortunately, widely used in practice Incremental: practically stable & 1 pass [Bennett et al 2009] ¤ ¤ ¤ ¤ MR Incremental: adapt Incremental to MapReduce • 13 Use KahanAdd © 2012 IBM Corporation Experiment Results Example Univariate Statistics Size (million) Sum Range R2 R3 Ranges : R1= [1.0 – 1.5), R2= [1000.0 – 1000.5), R3= [1000000.0 – 1000000.5) Mean Variance SML Naïve SML Naïve SML Textbook SML Textbook 10 16.8 14.4 16.5 14.4 15.4 5.9 15.9 6.2 100 16.1 13.4 16.9 13.4 15.6 5.3 15.8 5.6 1000 16.6 13.1 16.4 13.1 16.2 4.9 16.4 5.2 10 15.9 14.0 16.3 14.0 14.4 0 14.7 0 100 16.0 13.1 16.9 13.1 12.9 Negative 13.2 NA 1000 16.3 12.9 16.5 12.9 13.2 Negative 13.5 NA Significantly better accuracy! No sacrifice to performance! Example Bivariate Statistics Size (million) Range R1 vs R2 R2 vs R3 14 Standard Deviation CoVariance Pearson-R Data Sets: uniform distribution in 3 ranges SML Textbook SML Textbook 10 15.0 8.4 15.1 6.2 • 100 15.6 8.5 15.4 6.4 • 1000 16.0 8.7 15.7 6.2 10 13.5 3.0 13.5 3.0 100 12.8 2.8 12.7 NA 1000 13.6 3.9 13.8 NA R1= [1.0 – 1.5), R2= [1000.0 – 1000.5), R3= [1000000.0 – 1000000.5) modeled after NIST StRD benchmark Accuracy Measure: • • • LRE = log(relative error) # significant digits that match between the computed value and the true value. true value: produced by BigDecimal with precision 1000. © 2012 IBM Corporation Lessons Learned (1/2) Many existing numerical stable techniques can be adapted to the distributed environment Divide-and-conquer design helps in scaling to larger data sets while achieving good numerical accuracy • e.g. MR Kahan can handle more data than Kahan with numerical accuracy Kahan technique is useful beyond simple summation R1 10 million points R1 vs R2 10 million points Variance 15 Std Covariance Pearson-R ¤ + ¤ + ¤ + ¤ + 16.0 13.5 15.9 13.8 15.0 14.2 15.1 13.0 © 2012 IBM Corporation Lessons Learned (2/2) Shifting can be used for improved accuracy Accuracy degrades as magnitude of values increases • Achieve better accuracy by shifting the data points by a constant prior to computation • 10 million points Sum 16 Mean Variance Standard Deviation Range SML Naïve SML Naïve SML Textbook SML Textbook 1000 – 1000.5 16.8 14.4 16.5 14.4 15.4 5.9 15.9 6.2 1000,000 – 1000,000.5 15.9 14.0 16.3 14.0 14.4 0 14.7 0 Performance need not be sacrificed for accuracy © 2012 IBM Corporation Recommended Reading: • 17 Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002. Thanks! and Questions? © 2012 IBM Corporation