MichaelHay.pdf

Enabling Accurate Analysis of
Private Network Data
Michael Hay
Joint work with
Gerome Miklau, David Jensen, Chao Li, Don Towsley
University of Massachusetts, Amherst
Vibhor Rastogi, Dan Suciu
University of Washington
October 8, 2009
1
Analysis of social networks
Social network derived
from mobile phone call
records [Onnela, PNAS 07]
4.6M nodes
7.0M edges
edge if reciprocal phone calls
during 18 week interval
2
Analysis of social networks
Social network derived
from mobile phone call
records [Onnela, PNAS 07]
4.6M nodes
7.0M edges
Can we enable analysts to study networks in a way
edge if reciprocal phone calls
that protects sensitive information about
participants?
during 18 week interval
2
How to achieve both privacy and utility?
DATA OWNER
ANALYST
A
B
C
W
D
V
X
K
J
E
O
M
Q
I
R
S
G
N
P
H
Z
U
Aa
Bb
Ee
Hh
Ff
Cc
what is diameter?
Q
what is maximum degree?
Q
how many 3 cliques?
Q
is Alice connected to Bob?
L
F
Y
Q
Gg
Dd
Private network
T
X
Allow aggregate statistics
provided facts about individuals are not disclosed
3
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Q
A = Q(G) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Q how many edges?
A = Q(G) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
11
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
11
+
2.3
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
11
+
2.3
13.3
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
11
+
2.3
13.3
• Dwork, McSherry, Nissim, Smith [Dwork, TCC 06] have described an
answer perturbation mechanism satisfying differential privacy.
• Comparatively few results for these techniques applied to graphs.
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
11
+
2.3
13.3
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
ANALYST
Q how many edges?
A = Q(G) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Q how many edges?
A = Q(G) + noise
Harry
Carol
Ed
Greg
ANALYST
Q
A = Q(G’) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Harry
Q how many edges?
A = Q(G) + noise
Pr[ A = x | μ = Q(G) ]
Carol
Ed
Greg
ANALYST
Q
A = Q(G’) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Harry
Q how many edges?
A = Q(G) + noise
Pr[ A = x | μ = Q(G) ]
=p
Carol
Ed
Greg
ANALYST
Q
A = Q(G’) + noise
Harry
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Harry
Q how many edges?
A = Q(G) + noise
Pr[ A = x | μ = Q(G) ]
=p
Carol
Ed
Greg
ANALYST
Harry
Q
A = Q(G’) + noise
Pr[ A = x | μ = Q(G’) ]
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Harry
Q how many edges?
A = Q(G) + noise
Pr[ A = x | μ = Q(G) ]
=p
Carol
Ed
Greg
ANALYST
Harry
Q
A = Q(G’) + noise
Pr[ A = x | μ = Q(G’) ] = q
4
[Dwork, TCC 06]
Query answer perturbation
DATA OWNER
Alice
G
Dave
Fred
Alice
G’
Bob
Bob
Dave
Fred
Carol
Ed
Greg
Harry
Carol
Ed
Greg
ANALYST
Harry
Q how many edges?
A = Q(G) + noise
Pr[ A = x | μ = Q(G) ]
=p
differ by at most
ε
factor of e
Q
A = Q(G’) + noise
Pr[ A = x | μ = Q(G’) ] = q
4
[Dwork, TCC 06]
Calibrating noise
• The following algorithm for answering Q is ε-differentially private:
A
Q
Q(G) + Noise(ΔQ / ε)
ΔQ: Max change in Q, over any two graphs differing by single edge
5
[Dwork, TCC 06]
Calibrating noise
• The following algorithm for answering Q is ε-differentially private:
A
Q
Q(G) + Noise(ΔQ / ε)
true
answer
ΔQ: Max change in Q, over any two graphs differing by single edge
5
[Dwork, TCC 06]
Calibrating noise
• The following algorithm for answering Q is ε-differentially private:
A
Q
Q(G) + Noise(ΔQ / ε)
true
answer
sample from
Laplace distribution
ΔQ: Max change in Q, over any two graphs differing by single edge
5
[Dwork, TCC 06]
Calibrating noise
• The following algorithm for answering Q is ε-differentially private:
privacy
parameter
A
Q
Q(G) + Noise(ΔQ / ε)
true
answer
sample from
Laplace distribution
ΔQ: Max change in Q, over any two graphs differing by single edge
5
[Dwork, TCC 06]
Calibrating noise
• The following algorithm for answering Q is ε-differentially private:
sensitivity of Q
A
privacy
parameter
Q
Q(G) + Noise(ΔQ / ε)
true
answer
sample from
Laplace distribution
ΔQ: Max change in Q, over any two graphs differing by single edge
5
Positive results in differential privacy
• Some common analyses have low sensitivity: contingency tables,
histograms [Dwork, TCC 06]
• Data mining algorithms implemented using only low sensitivity queries:
PCA, k-Means, Decision Trees [Blum, PODS 05]
• Learning theory: possible to learn any concept class with polynomial VC
dimension; half-space queries can be learned efficiently [Blum, STOC 08]
• Many challenges remain...
• Beyond tabular data
• Optimal query strategies?
6
[ICDM 09]
Accurate degree distribution estimation is possible
• Degree distribution: the frequency of each degree in graph.
• A widely studied property of networks.
Histogram
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
4
3
2
1
0
0 1 2 3 4 5
Degree sequence
as a vector
Harry
[1,1,2,2,4,4,4,4]
CCDF
8
6
4
2
0
0 1 2 3 4 5
Degree sequence
4
3
2
1
0
1 2 3 4 5 6 7 8
7
[ICDM 09]
Accurate degree distribution estimation is possible
• Degree distribution: the frequency of each degree in graph.
• A widely studied property of networks.
Histogram
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
4
3
2
1
0
0 1 2 3 4 5
Degree sequence
as a vector
Harry
[1,1,2,2,4,4,4,4]
CCDF
8
6
4
2
0
0 1 2 3 4 5
Degree sequence
4
3
2
1
0
1 2 3 4 5 6 7 8
7
[ICDM 09]
Accurate degree distribution estimation is possible
• Degree distribution: the frequency of each degree in graph.
• A widely studied property of networks.
Histogram
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
4
3
2
1
0
0 1 2 3 4 5
Degree sequence
as a vector
Harry
[1,1,2,2,4,4,4,4]
CCDF
8
6
4
2
0
0 1 2 3 4 5
Degree sequence
4
3
2
1
0
1 2 3 4 5 6 7 8
7
[ICDM 09]
Accurate degree distribution estimation is possible
• Degree distribution: the frequency of each degree in graph.
• A widely studied property of networks.
Histogram
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
4
3
2
1
0
0 1 2 3 4 5
Degree sequence
as a vector
Harry
[1,1,2,2,4,4,4,4]
CCDF
8
6
4
2
0
0 1 2 3 4 5
Degree sequence
4
3
2
1
0
1 2 3 4 5 6 7 8
7
[ICDM 09]
Accurate degree distribution estimation is possible
• Degree distribution: the frequency of each degree in graph.
• A widely studied property of networks.
Histogram
Alice
G
Bob
Dave
Fred
Carol
Ed
Greg
4
3
2
1
0
0 1 2 3 4 5
Degree sequence
as a vector
Harry
[1,1,2,2,4,4,4,4]
CCDF
8
6
4
2
0
0 1 2 3 4 5
Degree sequence
4
3
2
1
0
1 2 3 4 5 6 7 8
7
Using the sort constraint
S(G) = [10, 10, ....10, 10, 14, 18,18,18,18]
8
20
Using the sort constraint
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
S(G) = [10, 10, ....10, 10, 14, 18,18,18,18]
8
20
Using the sort constraint
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
8
20
Using the sort constraint
●
●
15
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
●
● ● ● ●
●
●
0
5
10
●
●
● ●
●
●
●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
●
●
●
●
●
●
0
5
10
15
20
25
• The output of the sorted degree query is not (in general) sorted.
• We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance.
8
20
Using the sort constraint
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
●
●
15
●
10
●
●
●
●
●
● ●
● ●
● ●
●
● ●
●
●
● ● ●
●
●
●
●
●
0
5
●
0
5
10
15
20
25
• The output of the sorted degree query is not (in general) sorted.
• We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance.
8
20
Using the sort constraint
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
●
●
15
●
10
●
●
●
●
●
● ●
● ●
● ●
●
● ●
●
●
● ● ●
●
●
●
●
●
5
●
0
= 19th smallest
degree + noise
0
5
10
15
20
25
• The output of the sorted degree query is not (in general) sorted.
• We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance.
8
20
Using the sort constraint
●
●
15
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ●
● ● ● ●
5
10
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ●
●
● ● ●
●
●
●
●
● ●
●
●
●
0
= 19th smallest
degree + noise
0
5
10
15
20
25
• The output of the sorted degree query is not (in general) sorted.
• We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance.
8
original
noisy
inferred
Experimental results
9
19
49
0
1
4
9
19
49
1.0
0.8
0.4
0.2
0.0
0
1
4
9
19
49
99
0
1
4
9
19
49
99
4
9
49
499
0
4
9
49
499
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
0
1.0
4
1.0
1
1.0
0
0.0
ε=.01
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.0
0.2
ε=.001
powerlaw
α=1.5, n=5M
1.0
orkut
n=3.1M
1.0
livejournal
n=5.3M
9
20
After inference, noise only where needed
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
• Standard Laplace noise is sufficient but not necessary for differential privacy.
• By using inference, effectively apply a different noise distribution -- more noise
where it is needed, less otherwise.
• Improvement in accuracy will depend on sequence
10
20
After inference, noise only where needed
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
• Standard Laplace noise is sufficient but not necessary for differential privacy.
• By using inference, effectively apply a different noise distribution -- more noise
where it is needed, less otherwise.
• Improvement in accuracy will depend on sequence
10
20
After inference, noise only where needed
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
• Standard Laplace noise is sufficient but not necessary for differential privacy.
• By using inference, effectively apply a different noise distribution -- more noise
where it is needed, less otherwise.
• Improvement in accuracy will depend on sequence
10
20
After inference, noise only where needed
●
15
● ● ● ●
●
10
●
S(G) true degree sequence
noisy observations (ε = 2)
inferred degree sequence
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
5
●
0
5
10
15
20
25
• Standard Laplace noise is sufficient but not necessary for differential privacy.
• By using inference, effectively apply a different noise distribution -- more noise
where it is needed, less otherwise.
• Improvement in accuracy will depend on sequence
10
20
After inference, noise only where needed
●
●
10
15
●
S(G) true degree sequence
noisy observations (ε = 2) Mean Squared
inferred
degree
sequence
Before
inference
Θ(n/!2 )
● ● ●
inference
● After
● ● ●
● ● ● ●
Error
● ●
● ●
●
3
● ●
2
O(dlog
n/!
● ● ● ● ● ● ●)
0
5
number of distinct degrees
0
5
10
15
20
25
• Standard Laplace noise is sufficient but not necessary for differential privacy.
• By using inference, effectively apply a different noise distribution -- more noise
where it is needed, less otherwise.
• Improvement in accuracy will depend on sequence
10
Conclusion
• Possible to accurately estimate degree distributions while providing
strong guarantees of privacy
• Other findings
• Some network analyses cannot be accurately answered under
differential privacy (clustering coefficient, motif analysis [Nissim,
STOC 07] [PODS 09] )
• Apply inference to other queries (e.g. histograms [CoRR 09])
• Future work: generate accurate synthetic networks under differential
privacy?
11
Questions?
Additional details on our work may be found here:
• [ICDM 09] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree
distribution of private networks. In International Conference on Data Mining (ICDM), To
appear, 2009.
• [CoRR 09] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of
differentially-private queries through consistency. CoRR, abs/0904.0942, 2009. (under
review)
• [PODS 09] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship privacy: Output
perturbation for queries with joins. In Principles of Database Systems (PODS), 2009.
• [VLDB 08] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural
identification in anonymized social networks. In VLDB Conference, 2008.
http://www.cs.umass.edu/~mhay/
12
References
• [Blum, PODS 05] A. Blum C. Dwork, F. McSherry, and K. Nissim. Practical Privacy: The
SuLQ Framework. PODS, 2005.
• [Blum, STOC 08] A. Blum, K. Ligett, and A. Roth. Learning Theory Approach to NonInteractive Database Privacy, STOC 2008.
• [Onnela, PNAS 07] J. Onnela et al. Structure and tie strengths in mobile
communication networks, PNAS 2007.
• [Liu, SIGMOD 08] K. Liu and E. Terzi. Towards identity anonymization on graphs. In
SIGMOD, 2008.
• [Zhou, ICDE 08] B. Zhou and J. Pei. Preserving privacy in social networks against
neighborhood attacks. In ICDE, 2008.
• [Zou, VLDB 09] L. Zou, L. Chen, and T. Ozsu. K-automorphism: A general framework
for privacy preserving network publication. In Proceedings of VLDB Conference, 2009.
• [Ying, SDM 2008] X. Ying and X. Wu. Randomizing social networks: a spectrum
preserving approach. In SIAM International Conference on Data Mining, 2008.
13
References (con’t)
• [Cormode, VLDB 09] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy.
Class-based graph anonymization for social network data. In VLDB Conference, 2009.
• [Campan, PinKDD 08] A. Campan and T. M. Truta. A clustering approach for data and
structural anonymity in social networks. In PinKDD, 2008.
• [Rastogi, VLDB 07] V. Rastogi, S. Hong, and D. Suciu. The boundary between privacy
and utility in data publishing. In VLDB, pages 531–542, 2007.
• [Dwork, TCC 06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to
sensitivity in private data analysis. In Third Theory of Cryptography Conference, 2006.
• [Nissim, STOC 07] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and
sampling in private data analysis. In STOC, pages 75–84, 2007.
14