Adaptive Data Analysis without Overfitting Vitaly Feldman Accelerated Discovery Lab IBM Research - Almaden Cynthia Dwork Microsoft Res. Moritz Hardt Toni Pitassi Omer Reingold Google Res. U. of Toronto Samsung Res. Aaron Roth Penn, CS Distribution π over π Dataset Param. estimates Classifier, Clustering etc. Results π = π₯1 , β¦ , π₯π βΌ ππ Analysis Data Analysis 101 Does student nutrition affect academic performance? Normalized grade 50 100 0.13 -0.44 -0.10 -0.38 -0.16 1.31 1.19 0.81 1.51 1.46 0.31 -1.70 -0.20 -0.42 -0.78 0.15 1.20 -0.76 0.59 -0.50 0.39 -0.82 0.89 -0.69 0.08 -0.41 0.06 -0.78 0.45 -1.40 0.44 0.73 -0.72 0.16 0.34 0.39 -1.21 -0.52 -0.60 -0.35 0.97 -0.77 0.11 0.19 0.20 -0.69 0.45 1.18 -0.30 0.00 0.98 1.95 0.10 -1.83 -0.74 -0.02 -1.53 -1.13 1.86 0.54 -0.11 -0.21 1.06 0.30 -0.31 -1.61 -1.42 0.00 1.10 1.13 -0.49 0.44 0.16 1.31 -0.41 2.03 -0.05 0.34 0.69 -1.25 -0.28 0.40 -0.39 -1.00 -0.01 1.46 -0.28 1.42 -0.64 1.10 0.05 1.52 0.42 1.23 0.13 -0.96 0.28 -0.55 0.99 -1.04 -2.30 -0.51 0.03 0.25 0.50 -0.45 -0.03 -1.17 0.20 0.04 1.07 -1.00 -1.93 -1.09 1.38 -1.02 -0.87 -1.57 -0.49 0.68 2.14 -0.13 -0.31 0.31 1.87 -0.18 -0.40 -0.53 -0.45 0.07 -0.18 1.23 0.36 0.11 0.00 -0.71 -0.82 1.81 -1.24 -0.88 -1.06 0.55 1.09 -1.73 2.11 -0.40 -0.18 -1.21 1.33 1.68 0.43 -0.73 -0.96 0.01 -1.35 -0.97 -2.13 0.93 -2.07 -0.94 0.58 0.16 -0.31 1.76 -0.39 1.03 0.71 -0.70 0.87 -0.34 1.11 0.90 1.35 0.27 -2.28 0.14 0.29 -0.47 0.42 1.64 -0.22 0.06 -0.03 0.20 0.35 0.56 -0.97 0.30 1.86 1.87 -1.99 -1.30 0.62 0.36 2.42 0.14 -0.53 2.43 1.16 1.21 -0.50 0.88 1.23 -1.33 0.63 0.85 0.05 -0.49 0.17 -1.18 -0.20 0.89 -0.05 2.05 0.49 1.50 -0.68 1.03 1.94 -1.16 -2.51 0.17 0.08 0.70 0.43 -0.31 0.63 -0.78 0.27 -1.79 0.15 -0.09 -0.47 -0.48 -1.04 1.03 0.61 2.78 -0.62 -1.92 -1.28 -0.53 -0.68 -0.67 2.18 -1.35 0.13 -0.57 -0.24 1.88 -0.13 -0.28 -0.05 0.43 1.07 2.36 -1.14 -0.76 -0.64 -0.80 -0.21 -0.24 -1.35 0.87 -0.28 -1.26 -1.19 -0.53 0.87 -1.40 0.23 0.19 -0.74 -1.83 -0.29 -1.76 -0.45 -0.65 1.96 -0.13 0.03 -1.07 -0.58 1.14 -1.51 0.41 1.43 1.86 0.22 -2.31 1.50 1.45 -2.07 0.46 -1.67 -1.23 1.46 0.88 -0.20 -0.59 -0.90 -0.98 -0.14 0.39 -0.79 -1.73 -0.49 -0.74 -0.74 0.43 -0.63 -2.26 0.96 -0.40 0.03 -1.20 0.02 -1.97 0.45 0.28 -1.59 0.46 -0.29 1.04 0.94 0.08 -0.96 -0.73 1.89 -0.97 0.14 0.87 0.12 -0.29 -0.11 1.12 -0.66 1.31 -0.96 0.36 -1.50 -0.01 1.30 -0.57 -1.12 0.64 0.05 0.46 1.30 0.71 -0.53 0.20 -1.20 -0.85 -1.90 -1.69 1.21 -0.72 0.16 1.29 0.74 0.49 -0.13 0.58 -1.04 -0.34 -2.09 0.39 -0.38 -0.89 -1.55 -0.10 0.04 -0.27 -0.45 -0.56 -0.94 -2.54 0.63 -1.17 1.26 -0.05 -0.28 -0.61 1.39 0.01 0.96 -1.82 0.45 0.84 -1.17 -1.06 -0.82 0.23 -0.91 -0.28 -1.21 -0.08 1.00 -0.18 0.17 0.41 0.66 -0.54 -0.20 0.38 0.91 0.31 1.03 1.01 0.08 -1.34 0.81 -0.74 1.30 -2.55 1.47 -0.10 -0.98 0.36 -1.37 0.99 -0.31 0.20 0.65 -1.31 0.66 0.99 1.32 -0.03 -0.54 -0.23 -1.47 2.11 0.14 0.80 0.82 -0.79 2.01 2.05 0.91 1.67 -0.09 1.22 0.32 -0.48 0.71 -0.33 -0.30 -0.09 0.65 0.89 1.72 -0.55 0.01 -1.42 0.71 2.03 0.06 -0.50 0.27 0.19 -1.20 0.38 0.13 -0.20 1.10 -0.77 0.65 -0.04 -0.50 -0.75 -0.14 0.86 1.02 0.05 -0.01 -0.90 1.69 -0.74 0.42 0.57 -0.74 -1.69 -0.44 0.16 -0.39 -0.39 -0.76 -0.46 -0.58 -1.37 -0.53 -1.49 -0.10 0.24 1.03 0.65 0.48 -1.83 1.90 1.60 0.70 -0.40 1.11 -1.27 -0.38 -0.82 1.91 0.00 -0.63 -0.64 1.22 -0.25 -1.03 1.17 -0.33 0.97 -0.17 0.12 -0.11 0.90 0.66 0.29 -0.45 -1.45 0.22 2.01 0.48 -0.42 0.52 -0.68 -1.24 0.48 0.02 0.47 -0.83 1.04 -1.03 1.75 0.44 0.17 -0.02 0.28 -0.66 -2.03 -1.51 -0.31 0.35 -1.97 2.81 -0.53 1.87 0.13 -0.74 -0.84 -1.87 -0.40 -1.16 -1.07 1.10 2.61 -0.62 -0.37 -0.59 0.23 -0.39 0.65 1.70 0.38 -0.38 0.02 0.03 2.03 -0.55 2.13 -0.56 -0.32 -0.81 -1.55 1.45 -0.31 -0.54 0.30 0.97 -0.02 -0.20 -0.40 -1.54 1.40 1.11 -0.76 0.81 -1.21 -0.57 -1.76 0.79 0.69 1.31 1.18 0.63 -0.89 -0.11 1.19 -0.27 -0.43 -0.82 -1.12 0.57 -1.10 0.16 0.22 -1.40 -0.23 -0.79 0.90 -0.13 -1.94 0.78 -0.94 -0.54 0.98 1.64 0.87 0.36 1.24 1.23 -1.34 -0.87 1.56 -1.54 -1.11 0.60 0.12 -0.77 -0.89 0.07 -0.90 -1.43 -0.09 1.21 -0.19 -0.87 -1.03 0.28 0.62 0.40 0.49 1.01 -0.90 0.00 -0.56 0.34 1.18 0.38 -0.53 -1.38 1.30 1.50 -2.33 -0.53 0.18 1.33 -0.47 -1.28 1.56 -0.90 2.50 0.40 -1.10 0.31 0.66 1.13 -0.12 -0.31 0.79 0.17 -0.01 -2.36 -0.80 1.80 -0.01 0.06 3.01 1.15 0.73 -2.63 1.94 0.71 1.77 0.31 -1.23 0.09 1.70 1.08 -0.08 -0.43 -2.02 -1.18 -1.26 -0.36 0.82 0.82 0.46 0.92 0.07 0.36 0.45 0.05 0.34 -0.13 0.11 -0.97 Check correlations Correlations with grade 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0.1 -0.2 -0.3 -0.4 Pick candidate foods Correlations with grade 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 -0.1 -0.2 -0.3 -0.4 Fit linear function of 3 selected foods True vs Predicted Grade 1.5 SUMMARY OUTPUT 1 Regression Statistics Multiple R 0.4453533 R Square 0.1983396 Adjusted R Square 0.1732877 Standard Error 1.0041891 Observations 100 0.5 0 -4 -3 -2 -1 0 1 2 3 4 -0.5 -1 -1.5 ANOVA df Regression Residual Total Intercept Mushroom Pumpkin Nutella 3 96 99 SS MS F 23.95086544 7.983622 7.917151 96.80600126 1.008396 120.7568667 Coefficients Standard Error t Stat P-value -0.044248 0.100545016 -0.44008 0.660868 -0.296074 0.10193011 -2.90468 0.004563 0.255769 0.108443069 2.358555 0.020373 0.2671363 0.095186165 2.806462 0.006066 Significance F 8.98706E-05 β 0.0001 True π: π₯π , π¦ βΌ π(0,1) Uncorrelated Freedmanβs Paradox [1983] Statistical inference βFreshβ i.i.d. samples Data Data Data Techniques Procedure CLT VC dimension Rademacher compl. Stability β¦ Hypothesis tests Regression Learning Result + generalization guarantees 0 π Holdout validation Data Data Data Data Data Training Predictor π Data Holdout/Testing Test error of π 0 πΈππ(π) Data analysis is adaptive Procedure at step π‘ depends on results of previous procedures on the same dataset Data Data A Data B β’ β’ β’ β’ β’ Exploratory data analysis Variable selection Hyper-parameter tuning Shared datasets β¦ Is this a real problem? βWhy Most Published Research Findings Are Falseβ [Ioannidis 05] 1,000,000+ downloads; 1,400+ citations Adaptive data analysis is one of the causes β’ β’ Researcher degrees of freedom [Simmons,Nelson,Simonsohn 11] Garden of forking paths [Gelman, Loken 15] Leads to invalid (cross-)validation error estimates [Reunanen 03; Rao, Fung 08; Cawley, Talbot 10] Approaches to adaptive analysis Abstinence e.g. pre-registration and data splitting Need safe techniques Adaptive analysis π΄1 π£1 π΄2 π£2 π΄π π£π π = π₯1 , β¦ , π₯π βΌ ππ Data analyst(s) π΄π : Ξ(π) β ππ π£π is βcloseβ to π΄π π with prob. 1 β π½ Adaptive statistical queries π1 π£1 π2 π£2 ππ π£π π Statistical query π π = π₯1oracle , β¦ , π₯π βΌ π [Kearns 93] Data analyst(s) ππ : π β 0,1 π£π β ππ₯βΌπ ππ π₯ β€ Ο with prob. 1 β π½ Can measure correlations, moments, accuracy/loss π π₯ = β(π, π₯) Run any statistical query algorithm! β’ Stochastic convex optimization [F., Guzman, Vempala 15] Answering non-adaptive SQs Given π non-adaptive query functions π1 , β¦ , ππ and π i.i.d. samples from π estimate ππ₯βΌπ ππ π₯ Use empirical mean: ππ ππ = 1 π π₯βπ log (π/π½) π=π π2 ππ π₯ Answering adaptive SQs Estimate expectations of π adaptively chosen functions What if we use ππ ππ ? For some constant π½ > 0: πβ₯ e.g. Freedmanβs paradox Data splitting: π = π πβ log π π2 2 π/π Our results Exists an algorithm that can answer π adaptively chosen SQs with accuracy π for π β log 3/2 π π=π π 2.5 In time π(π log π ) Data splitting: πβ log π π 2 π Exists an algorithm that can answer π adaptively chosen SQs with accuracy π for log π β log π π=π π4 In time π(π π ) Algorithmic stability βSmall changesβ to the dataset have βsmall effectβ on the output β’ β’ [Rogers,Wagner 78; Devroye,Wagner 79] [Bousquet,Eliseeff 02; Kutin,Nyogi 02; Raklin, Mukherjee, Poggio 05; ShalevShwatrz,Shamir,Srebro,Sridharan 10] Algorithm π΄ is πΌ-strongly-uniformly-replace-one stable for β if for all π, π β² β π π such that dist π, π β² = 1 and π₯ β π: β π΄ π , π₯ β β π΄ πβ² , π₯ On-average generalization: π π πβπΊ β(π΄ πΊ , π) β ππβΌπ β(π΄ πΊ , π) πΊβΌπ β€πΌ β€πΌ High probability: ππ« πΊβΌππ β π΄ πΊ , π β ππβΌπ β π΄ πΊ , π πβπΊ β₯ πΌ π log 1/π½ β€π½ Differential Privacy S Cynthia Frank Aaron Chris [Dwork,McSherry,Nissim,Smith 06] Kobbi Adam Algorithm ratio bounded Pr[π] π΄ (Randomized) algorithm π΄ is (π, πΏ)-differentially private if for any two data sets π, πβ² such that dist π, π β² = 1: βπ β range π΄ , Pr π΄ π β π β€ π π β Pr π΄ π β² β π + πΏ π΄ π΄ Foklore: π, πΏ -DP implies π π β 1 + πΏ -strongly-uniform-replace-one stability for any β with range [0,1] DP composes adaptively ππ A π π1 , πΏ1 π ππ B π2 , πΏ2 πβ² DP composes adaptively ππ A B πβ² π1 + π2 , πΏ1 + πΏ2 DP composes adaptively Composition of π π-DP algorithms: for every πΏ > 0, is π 2π log 1/πΏ , πΏ -DP [Dwork,Rothblum,Vadhan 10] DP implies generalization If a π-DP algorithm π΄ outputs a function π:π β 0,1 then 2 Pr π ππΊ π¨ πΊ β ππ π¨ πΊ β₯ π β€ 6π βπ π π¨,πΊβΌπ Similarly, for π, π RO stability: π π¨,πΊβΌππ 1 βπ -DP algorithms ππΊ π¨ πΊ β ππ π¨ πΊ β€π Proof ideas If a π-DP algorithm π΄ outputs a function then |ππΊ π¨(πΊ) β ππ [π¨(πΊ)]| β€ π with high prob. over πΊ βΌ ππ Max-information: Iβ πΏ; π Iβ πΏ; π = log Pr πΏ=π₯ π=π¦] max Pr πΏ=π₯ π₯,π¦ = Dβ πΏ, π || πΏ × π Let π΄:π π β π be an algorithm s.t. Iβ πΊ; π¨(πΊ) = π Then for any event π β π π × π, and πΊ, π» βΌ ππ Pr πΊ, π¨ πΊ β π β€ π π β Pr π», π¨ πΊ β π Define π = π, π s. t. ππ π β ππ π > π Then Pr π», π¨ πΊ β π β€ 2 exp(β2π 2 π) π β€ π 2 π gives Pr |ππΊ π¨(πΊ) β ππ [π¨(πΊ)]| > π β€ 2 exp(βπ 2 π) Bounding Iβ πΊ; π¨(πΊ) By π-DP for any adjacent π, πβ² and any π Pr π¨ πΊ = π | πΊ = π Pr π¨ π = π = β€ ππ Pr π¨ πΊ = π | πΊ = πβ² Pr π¨ πβ² = π For any π1 , π2 and any π Pr π¨ πΊ = π | πΊ = π1 β€ π ππ Pr π¨ πΊ = π | πΊ = π2 and so Pr π¨ πΊ =π | πΊ=π1 Pr π¨ πΊ =π β€ π ππ Then Iβ πΊ; π¨(πΊ) = ππ. Need ππ β€ π 2 π and so require π β€ π 2 Use McDiarmidβs inequality to bound by exp O π 2 π + π π Relies on approximate max-information w.h.p. Approximate max-information: π½ πΌβ (π¨ π-Differential privacy π½ πΌβ π¨ πΊ ; πΊ β€ π 2 π + π π log 1/π½ Description length π½ πΌβ π¨ πΊ ; πΊ β€ log Range π΄ /π½ β’ β’ Preserves generalization Composes adaptively πΊ ; πΊ) Reusable holdout Data Data Data Training π Data Data Data Holdout π» π1 π£1 π2 π£2 ππ π£π Analyst(s) π» Reusable Holdout algorithm Reusable holdout Given a holdout set of π i.i.d. samples Thresholdout can π-estimate the expectation of π adaptively-chosen functions as long as at most π΅ overfit to the training set for π΅ β log π π~ π2 Overfitting: π π [π] β ππ [π] > π/2 Standard holdout (empirical mean): π ~ π log π π2 Thresholdout algorithm β’ Input: π, π», query budget π, correction budget π΅ β’ while (m>0 and B>0) noise β Get query π β If ππ» [π] β π π [π] < 3π/4 + π output π π [π] β Else β’ Output ππ» [π] + π β’ B β Bβ1 β m β mβ1 noise Illustration β’ Given points in ππ × β1, +1 β’ Pick variables correlated with the label on the training set. Validate on the holdout set. β’ π is the set of variables correlated on both training and holdout β’ Check prediction error of the linear classifier given by sign πβπ π π π₯π where π π is the sign of the correlation Data distribution: 10,000 points from π 0,1 10,000 randomly labeled Further reading Preserving Statistical Validity in Adaptive Data Analysis, http://arxiv.org/abs/1411.2664 [STOC 2015] Generalization in Adaptive Data Analysis and Holdout Reuse http://arxiv.org/abs/1506.02629 [NIPS 2015] Overview: Reusable holdout: Preserving validity in adaptive data analysis. [Science, 2015] Come to workshop on βAdaptive Data Analysisβ at NIPS 2015! wadapt.org β’ π = Ξ© π/π for poly-time algorithms assuming OWF exists [Hardt,Ullman 14; Steinke,Ullman 15] β’ Stronger (tight) generalization for (π, πΏ)-DP algorithms and any low-sensitivity queries [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] β’ Application to maintaining accurate leaderboard in ML competitions [Blum,Hardt 15]