slides

Adaptive Data Analysis without
Overfitting
Vitaly Feldman
Accelerated Discovery Lab
IBM Research - Almaden
Cynthia Dwork
Microsoft Res.
Moritz Hardt Toni Pitassi Omer Reingold
Google Res.
U. of Toronto
Samsung Res.
Aaron Roth
Penn, CS
Distribution
𝑃 over 𝑋
Dataset
Param. estimates
Classifier,
Clustering
etc.
Results
𝑆 = π‘₯1 , … , π‘₯𝑛 ∼ 𝑃𝑛
Analysis
Data Analysis 101
Does student nutrition affect academic performance?
Normalized grade
50
100
0.13
-0.44
-0.10
-0.38
-0.16
1.31
1.19
0.81
1.51
1.46
0.31
-1.70
-0.20
-0.42
-0.78
0.15
1.20
-0.76
0.59
-0.50
0.39
-0.82
0.89
-0.69
0.08
-0.41
0.06
-0.78
0.45
-1.40
0.44
0.73
-0.72
0.16
0.34
0.39
-1.21
-0.52
-0.60
-0.35
0.97
-0.77
0.11
0.19
0.20
-0.69
0.45
1.18
-0.30
0.00
0.98
1.95
0.10
-1.83
-0.74
-0.02
-1.53
-1.13
1.86
0.54
-0.11
-0.21
1.06
0.30
-0.31
-1.61
-1.42
0.00
1.10
1.13
-0.49
0.44
0.16
1.31
-0.41
2.03
-0.05
0.34
0.69
-1.25
-0.28
0.40
-0.39
-1.00
-0.01
1.46
-0.28
1.42
-0.64
1.10
0.05
1.52
0.42
1.23
0.13
-0.96
0.28
-0.55
0.99
-1.04
-2.30
-0.51
0.03
0.25
0.50
-0.45
-0.03
-1.17
0.20
0.04
1.07
-1.00
-1.93
-1.09
1.38
-1.02
-0.87
-1.57
-0.49
0.68
2.14
-0.13
-0.31
0.31
1.87
-0.18
-0.40
-0.53
-0.45
0.07
-0.18
1.23
0.36
0.11
0.00
-0.71
-0.82
1.81
-1.24
-0.88
-1.06
0.55
1.09
-1.73
2.11
-0.40
-0.18
-1.21
1.33
1.68
0.43
-0.73
-0.96
0.01
-1.35
-0.97
-2.13
0.93
-2.07
-0.94
0.58
0.16
-0.31
1.76
-0.39
1.03
0.71
-0.70
0.87
-0.34
1.11
0.90
1.35
0.27
-2.28
0.14
0.29
-0.47
0.42
1.64
-0.22
0.06
-0.03
0.20
0.35
0.56
-0.97
0.30
1.86
1.87
-1.99
-1.30
0.62
0.36
2.42
0.14
-0.53
2.43
1.16
1.21
-0.50
0.88
1.23
-1.33
0.63
0.85
0.05
-0.49
0.17
-1.18
-0.20
0.89
-0.05
2.05
0.49
1.50
-0.68
1.03
1.94
-1.16
-2.51
0.17
0.08
0.70
0.43
-0.31
0.63
-0.78
0.27
-1.79
0.15
-0.09
-0.47
-0.48
-1.04
1.03
0.61
2.78
-0.62
-1.92
-1.28
-0.53
-0.68
-0.67
2.18
-1.35
0.13
-0.57
-0.24
1.88
-0.13
-0.28
-0.05
0.43
1.07
2.36
-1.14
-0.76
-0.64
-0.80
-0.21
-0.24
-1.35
0.87
-0.28
-1.26
-1.19
-0.53
0.87
-1.40
0.23
0.19
-0.74
-1.83
-0.29
-1.76
-0.45
-0.65
1.96
-0.13
0.03
-1.07
-0.58
1.14
-1.51
0.41
1.43
1.86
0.22
-2.31
1.50
1.45
-2.07
0.46
-1.67
-1.23
1.46
0.88
-0.20
-0.59
-0.90
-0.98
-0.14
0.39
-0.79
-1.73
-0.49
-0.74
-0.74
0.43
-0.63
-2.26
0.96
-0.40
0.03
-1.20
0.02
-1.97
0.45
0.28
-1.59
0.46
-0.29
1.04
0.94
0.08
-0.96
-0.73
1.89
-0.97
0.14
0.87
0.12
-0.29
-0.11
1.12
-0.66
1.31
-0.96
0.36
-1.50
-0.01
1.30
-0.57
-1.12
0.64
0.05
0.46
1.30
0.71
-0.53
0.20
-1.20
-0.85
-1.90
-1.69
1.21
-0.72
0.16
1.29
0.74
0.49
-0.13
0.58
-1.04
-0.34
-2.09
0.39
-0.38
-0.89
-1.55
-0.10
0.04
-0.27
-0.45
-0.56
-0.94
-2.54
0.63
-1.17
1.26
-0.05
-0.28
-0.61
1.39
0.01
0.96
-1.82
0.45
0.84
-1.17
-1.06
-0.82
0.23
-0.91
-0.28
-1.21
-0.08
1.00
-0.18
0.17
0.41
0.66
-0.54
-0.20
0.38
0.91
0.31
1.03
1.01
0.08
-1.34
0.81
-0.74
1.30
-2.55
1.47
-0.10
-0.98
0.36
-1.37
0.99
-0.31
0.20
0.65
-1.31
0.66
0.99
1.32
-0.03
-0.54
-0.23
-1.47
2.11
0.14
0.80
0.82
-0.79
2.01
2.05
0.91
1.67
-0.09
1.22
0.32
-0.48
0.71
-0.33
-0.30
-0.09
0.65
0.89
1.72
-0.55
0.01
-1.42
0.71
2.03
0.06
-0.50
0.27
0.19
-1.20
0.38
0.13
-0.20
1.10
-0.77
0.65
-0.04
-0.50
-0.75
-0.14
0.86
1.02
0.05
-0.01
-0.90
1.69
-0.74
0.42
0.57
-0.74
-1.69
-0.44
0.16
-0.39
-0.39
-0.76
-0.46
-0.58
-1.37
-0.53
-1.49
-0.10
0.24
1.03
0.65
0.48
-1.83
1.90
1.60
0.70
-0.40
1.11
-1.27
-0.38
-0.82
1.91
0.00
-0.63
-0.64
1.22
-0.25
-1.03
1.17
-0.33
0.97
-0.17
0.12
-0.11
0.90
0.66
0.29
-0.45
-1.45
0.22
2.01
0.48
-0.42
0.52
-0.68
-1.24
0.48
0.02
0.47
-0.83
1.04
-1.03
1.75
0.44
0.17
-0.02
0.28
-0.66
-2.03
-1.51
-0.31
0.35
-1.97
2.81
-0.53
1.87
0.13
-0.74
-0.84
-1.87
-0.40
-1.16
-1.07
1.10
2.61
-0.62
-0.37
-0.59
0.23
-0.39
0.65
1.70
0.38
-0.38
0.02
0.03
2.03
-0.55
2.13
-0.56
-0.32
-0.81
-1.55
1.45
-0.31
-0.54
0.30
0.97
-0.02
-0.20
-0.40
-1.54
1.40
1.11
-0.76
0.81
-1.21
-0.57
-1.76
0.79
0.69
1.31
1.18
0.63
-0.89
-0.11
1.19
-0.27
-0.43
-0.82
-1.12
0.57
-1.10
0.16
0.22
-1.40
-0.23
-0.79
0.90
-0.13
-1.94
0.78
-0.94
-0.54
0.98
1.64
0.87
0.36
1.24
1.23
-1.34
-0.87
1.56
-1.54
-1.11
0.60
0.12
-0.77
-0.89
0.07
-0.90
-1.43
-0.09
1.21
-0.19
-0.87
-1.03
0.28
0.62
0.40
0.49
1.01
-0.90
0.00
-0.56
0.34
1.18
0.38
-0.53
-1.38
1.30
1.50
-2.33
-0.53
0.18
1.33
-0.47
-1.28
1.56
-0.90
2.50
0.40
-1.10
0.31
0.66
1.13
-0.12
-0.31
0.79
0.17
-0.01
-2.36
-0.80
1.80
-0.01
0.06
3.01
1.15
0.73
-2.63
1.94
0.71
1.77
0.31
-1.23
0.09
1.70
1.08
-0.08
-0.43
-2.02
-1.18
-1.26
-0.36
0.82
0.82
0.46
0.92
0.07
0.36
0.45
0.05
0.34
-0.13
0.11
-0.97
Check correlations
Correlations with grade
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
-0.1
-0.2
-0.3
-0.4
Pick candidate foods
Correlations with grade
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
-0.1
-0.2
-0.3
-0.4
Fit linear function of 3 selected foods
True vs Predicted Grade
1.5
SUMMARY OUTPUT
1
Regression Statistics
Multiple R
0.4453533
R Square
0.1983396
Adjusted R Square
0.1732877
Standard Error
1.0041891
Observations
100
0.5
0
-4
-3
-2
-1
0
1
2
3
4
-0.5
-1
-1.5
ANOVA
df
Regression
Residual
Total
Intercept
Mushroom
Pumpkin
Nutella
3
96
99
SS
MS
F
23.95086544 7.983622 7.917151
96.80600126 1.008396
120.7568667
Coefficients Standard Error
t Stat
P-value
-0.044248
0.100545016 -0.44008 0.660868
-0.296074
0.10193011 -2.90468 0.004563
0.255769
0.108443069 2.358555 0.020373
0.2671363
0.095186165 2.806462 0.006066
Significance F
8.98706E-05
β‰ˆ 0.0001
True 𝑃:
π‘₯𝑖 , 𝑦 ∼ 𝑁(0,1)
Uncorrelated
Freedman’s Paradox [1983]
Statistical inference
β€œFresh” i.i.d. samples
Data
Data
Data
Techniques
Procedure
CLT
VC dimension
Rademacher compl.
Stability
…
Hypothesis tests
Regression
Learning
Result +
generalization
guarantees
0
πœ‡
Holdout validation
Data
Data
Data
Data
Data
Training
Predictor 𝑓
Data
Holdout/Testing
Test error of 𝑓
0
πΈπ‘Ÿπ‘Ÿ(𝑓)
Data analysis is adaptive
Procedure at step 𝑑 depends on results of previous
procedures on the same dataset
Data
Data
A
Data
B
β€’
β€’
β€’
β€’
β€’
Exploratory data analysis
Variable selection
Hyper-parameter tuning
Shared datasets
…
Is this a real problem?
β€œWhy Most Published Research Findings
Are False” [Ioannidis 05]
1,000,000+ downloads; 1,400+ citations
Adaptive data analysis is one of the causes
β€’
β€’
Researcher degrees of freedom
[Simmons,Nelson,Simonsohn 11]
Garden of forking paths
[Gelman, Loken 15]
Leads to invalid (cross-)validation error estimates
[Reunanen 03; Rao, Fung 08; Cawley, Talbot 10]
Approaches to adaptive analysis
Abstinence
e.g. pre-registration
and data splitting
Need safe techniques
Adaptive analysis
𝐴1
𝑣1
𝐴2
𝑣2
π΄π‘š
π‘£π‘š
𝑆 = π‘₯1 , … , π‘₯𝑛 ∼ 𝑃𝑛
Data analyst(s)
𝐴𝑖 : Ξ”(𝑋) β†’ π‘Œπ‘–
𝑣𝑖 is β€œclose” to 𝐴𝑖 𝑃 with prob. 1 βˆ’ 𝛽
Adaptive statistical queries
πœ™1
𝑣1
πœ™2
𝑣2
πœ™π‘š
π‘£π‘š
𝑆
Statistical query 𝑛
𝑆 = π‘₯1oracle
, … , π‘₯𝑛 ∼ 𝑃
[Kearns 93]
Data analyst(s)
πœ™π‘– : 𝑋 β†’ 0,1
𝑣𝑖 βˆ’ 𝐄π‘₯βˆΌπ‘ƒ πœ™π‘– π‘₯ ≀ Ο„
with prob. 1 βˆ’ 𝛽
Can measure correlations, moments, accuracy/loss
πœ™ π‘₯ = β„“(𝑓, π‘₯)
Run any statistical query algorithm!
β€’ Stochastic convex optimization [F., Guzman, Vempala 15]
Answering non-adaptive SQs
Given π‘š non-adaptive query functions πœ™1 , … , πœ™π‘š and 𝑛
i.i.d. samples from 𝑃 estimate 𝐄π‘₯βˆΌπ‘ƒ πœ™π‘– π‘₯
Use empirical mean: 𝐄𝑆 πœ™π‘– =
1
𝑛
π‘₯βˆˆπ‘†
log (π‘š/𝛽)
𝑛=𝑂
𝜏2
πœ™π‘– π‘₯
Answering adaptive SQs
Estimate expectations of π‘š adaptively chosen functions
What if we use 𝐄𝑆 πœ™π‘– ?
For some constant 𝛽 > 0:
𝑛β‰₯
e.g. Freedman’s paradox
Data splitting: 𝑛 = 𝑂
π‘šβ‹…log π‘š
𝜏2
2
π‘š/𝜏
Our results
Exists an algorithm that can answer π‘š adaptively chosen SQs
with accuracy 𝜏 for
π‘š β‹… log 3/2 π‘š
𝑛=𝑂
𝜏 2.5
In time 𝑂(𝑛 log 𝑋 )
Data splitting:
π‘šβ‹… log π‘š
𝑂
2
𝜏
Exists an algorithm that can answer π‘š adaptively chosen SQs
with accuracy 𝜏 for
log π‘š β‹… log 𝑋
𝑛=𝑂
𝜏4
In time 𝑂(𝑛 𝑋 )
Algorithmic stability
β€œSmall changes” to the dataset have β€œsmall effect” on the
output
β€’
β€’
[Rogers,Wagner 78; Devroye,Wagner 79]
[Bousquet,Eliseeff 02; Kutin,Nyogi 02; Raklin, Mukherjee, Poggio 05; ShalevShwatrz,Shamir,Srebro,Sridharan 10]
Algorithm 𝐴 is 𝛼-strongly-uniformly-replace-one stable for β„“ if for
all 𝑆, 𝑆 β€² ∈ 𝑋 𝑛 such that dist 𝑆, 𝑆 β€² = 1 and π‘₯ ∈ 𝑋:
β„“ 𝐴 𝑆 , π‘₯ βˆ’ β„“ 𝐴 𝑆′ , π‘₯
On-average generalization:
𝐄 𝑛 π’™βˆˆπ‘Ί β„“(𝐴 𝑺 , 𝒙) βˆ’ π„π’™βˆΌπ‘ƒ β„“(𝐴 𝑺 , 𝒙)
π‘ΊβˆΌπ‘ƒ
≀𝛼
≀𝛼
High probability:
𝐏𝐫
π‘ΊβˆΌπ‘ƒπ‘›
β„“ 𝐴 𝑺 , 𝒙 βˆ’ π„π’™βˆΌπ‘ƒ β„“ 𝐴 𝑺 , 𝒙
π’™βˆˆπ‘Ί
β‰₯ 𝛼 𝑛 log 1/𝛽
≀𝛽
Differential Privacy
S
Cynthia
Frank
Aaron
Chris
[Dwork,McSherry,Nissim,Smith 06]
Kobbi
Adam
Algorithm
ratio bounded
Pr[π‘Ÿ]
𝐴
(Randomized) algorithm 𝐴 is (πœ–, 𝛿)-differentially private
if for any two data sets 𝑆, 𝑆′ such that dist 𝑆, 𝑆 β€² = 1:
βˆ€π‘ βŠ† range 𝐴 ,
Pr 𝐴 𝑆 ∈ 𝑍 ≀ 𝑒 πœ– β‹… Pr 𝐴 𝑆 β€² ∈ 𝑍 + 𝛿
𝐴
𝐴
Foklore: πœ–, 𝛿 -DP implies 𝑒 πœ– βˆ’ 1 + 𝛿 -strongly-uniform-replace-one stability
for any β„“ with range [0,1]
DP composes
adaptively
𝑋𝑛
A
π‘Œ
πœ–1 , 𝛿1
π‘Œ
𝑋𝑛
B
πœ–2 , 𝛿2
π‘Œβ€²
DP composes
adaptively
𝑋𝑛
A
B
π‘Œβ€²
πœ–1 + πœ–2 , 𝛿1 + 𝛿2
DP composes
adaptively
Composition of π‘š πœ–-DP algorithms: for
every 𝛿 > 0, is πœ– 2π‘š log 1/𝛿 , 𝛿 -DP
[Dwork,Rothblum,Vadhan 10]
DP implies
generalization
If a 𝜏-DP algorithm 𝐴 outputs
a function πœ™:𝑋 β†’ 0,1 then
2
Pr 𝑛 𝐄𝑺 𝑨 𝑺 βˆ’ 𝐄𝑃 𝑨 𝑺 β‰₯ 𝜏 ≀ 6𝑒 βˆ’πœ 𝑛
𝑨,π‘ΊβˆΌπ‘ƒ
Similarly, for 𝜏, 𝑒
RO stability:
𝐄
𝑨,π‘ΊβˆΌπ‘ƒπ‘›
1
βˆ’πœ
-DP algorithms
𝐄𝑺 𝑨 𝑺 βˆ’ 𝐄𝑃 𝑨 𝑺
β‰€πœ
Proof ideas
If a 𝜏-DP algorithm 𝐴 outputs a function then
|𝐄𝑺 𝑨(𝑺) βˆ’ 𝐄𝑃 [𝑨(𝑺)]| ≀ 𝜏 with high prob. over 𝑺 ∼ 𝑃𝑛
Max-information: I∞ 𝑿; 𝒀
I∞ 𝑿; 𝒀 = log
Pr 𝑿=π‘₯ 𝒀=𝑦]
max
Pr 𝑿=π‘₯
π‘₯,𝑦
= D∞ 𝑿, 𝒀 || 𝑿 × π’€
Let 𝐴:𝑋 𝑛 β†’ π‘Œ be an algorithm s.t. I∞ 𝑺; 𝑨(𝑺) = π‘˜
Then for any event 𝑅 βŠ† 𝑋 𝑛 × π‘Œ, and 𝑺, 𝑻 ∼ 𝑃𝑛
Pr 𝑺, 𝑨 𝑺 ∈ 𝑅 ≀ 𝑒 π‘˜ β‹… Pr 𝑻, 𝑨 𝑺 ∈ 𝑅
Define 𝑅 = 𝑆, πœ™ s. t. 𝐄𝑆 πœ™ βˆ’ 𝐄𝑃 πœ™ > 𝜏
Then Pr 𝑻, 𝑨 𝑺 ∈ 𝑅 ≀ 2 exp(βˆ’2𝜏 2 𝑛)
π‘˜ ≀ 𝜏 2 𝑛 gives Pr |𝐄𝑺 𝑨(𝑺) βˆ’ 𝐄𝑃 [𝑨(𝑺)]| > 𝜏 ≀ 2 exp(βˆ’πœ 2 𝑛)
Bounding I∞ 𝑺; 𝑨(𝑺)
By πœ–-DP for any adjacent 𝑆, 𝑆′ and any πœ™
Pr 𝑨 𝑺 = πœ™ | 𝑺 = 𝑆
Pr 𝑨 𝑆 = πœ™
=
≀ π‘’πœ–
Pr 𝑨 𝑺 = πœ™ | 𝑺 = 𝑆′
Pr 𝑨 𝑆′ = πœ™
For any 𝑆1 , 𝑆2 and any πœ™
Pr 𝑨 𝑺 = πœ™ | 𝑺 = 𝑆1
≀ 𝑒 πœ–π‘›
Pr 𝑨 𝑺 = πœ™ | 𝑺 = 𝑆2
and so
Pr 𝑨 𝑺 =πœ™ | 𝑺=𝑆1
Pr 𝑨 𝑺 =πœ™
≀ 𝑒 πœ–π‘›
Then I∞ 𝑺; 𝑨(𝑺) = πœ–π‘›. Need πœ–π‘› ≀ 𝜏 2 𝑛 and so require πœ– ≀ 𝜏 2
Use McDiarmid’s inequality to bound by exp O πœ– 2 𝑛 + πœ– 𝑛
Relies on approximate max-information
w.h.p.
Approximate max-information:
𝛽
𝐼∞ (𝑨
πœ–-Differential privacy
𝛽
𝐼∞ 𝑨 𝑺 ; 𝑺 ≀ πœ– 2 𝑛 + πœ– 𝑛 log 1/𝛽
Description length
𝛽
𝐼∞ 𝑨 𝑺 ; 𝑺 ≀ log Range 𝐴 /𝛽
β€’
β€’
Preserves generalization
Composes adaptively
𝑺 ; 𝑺)
Reusable holdout
Data
Data
Data
Training
𝑇
Data
Data
Data
Holdout
𝐻
πœ™1
𝑣1
πœ™2
𝑣2
πœ™π‘š
π‘£π‘š
Analyst(s)
𝐻
Reusable Holdout
algorithm
Reusable holdout
Given a holdout set of 𝑛 i.i.d. samples Thresholdout
can 𝜏-estimate the expectation of π‘š adaptively-chosen
functions as long as at most 𝐡 overfit to the training set
for
𝐡 β‹… log π‘š
𝑛~
𝜏2
Overfitting: 𝐄 𝑇 [πœ™] βˆ’ 𝐄𝑃 [πœ™] > 𝜏/2
Standard holdout (empirical mean): 𝑛 ~
π‘š log π‘š
𝜏2
Thresholdout algorithm
β€’ Input: 𝑇, 𝐻, query budget π‘š, correction budget 𝐡
β€’ while (m>0 and B>0)
noise
– Get query πœ™
– If 𝐄𝐻 [πœ™] βˆ’ 𝐄 𝑇 [πœ™] < 3𝜏/4 + πœ‚ output 𝐄 𝑇 [πœ™]
– Else
β€’ Output 𝐄𝐻 [πœ™] + 𝜁
β€’ B ← Bβˆ’1
– m ← mβˆ’1
noise
Illustration
β€’ Given points in 𝐑𝑑 × βˆ’1, +1
β€’ Pick variables correlated with the label on the training set.
Validate on the holdout set.
β€’ 𝑉 is the set of variables correlated on both training and holdout
β€’ Check prediction error of the linear classifier given by
sign
π‘–βˆˆπ‘‰
𝑠𝑖 π‘₯𝑖 where 𝑠𝑖 is the sign of the correlation
Data distribution: 10,000 points from 𝑁 0,1
10,000
randomly labeled
Further reading
Preserving Statistical Validity in Adaptive Data Analysis,
http://arxiv.org/abs/1411.2664 [STOC 2015]
Generalization in Adaptive Data Analysis and Holdout Reuse
http://arxiv.org/abs/1506.02629 [NIPS 2015]
Overview: Reusable holdout: Preserving validity in adaptive data analysis.
[Science, 2015]
Come to workshop on β€œAdaptive Data Analysis” at NIPS 2015!
wadapt.org
β€’ 𝑛 = Ξ© π‘š/𝜏 for poly-time algorithms assuming OWF
exists [Hardt,Ullman 14; Steinke,Ullman 15]
β€’ Stronger (tight) generalization for (πœ–, 𝛿)-DP algorithms
and any low-sensitivity queries
[Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15]
β€’ Application to maintaining accurate leaderboard in ML
competitions [Blum,Hardt 15]