Enabling Accurate Analysis of Private Network Data Michael Hay Joint work with Gerome Miklau, David Jensen, Chao Li, Don Towsley University of Massachusetts, Amherst Vibhor Rastogi, Dan Suciu University of Washington October 8, 2009 1 Analysis of social networks Social network derived from mobile phone call records [Onnela, PNAS 07] 4.6M nodes 7.0M edges edge if reciprocal phone calls during 18 week interval 2 Analysis of social networks Social network derived from mobile phone call records [Onnela, PNAS 07] 4.6M nodes 7.0M edges Can we enable analysts to study networks in a way edge if reciprocal phone calls that protects sensitive information about participants? during 18 week interval 2 How to achieve both privacy and utility? DATA OWNER ANALYST A B C W D V X K J E O M Q I R S G N P H Z U Aa Bb Ee Hh Ff Cc what is diameter? Q what is maximum degree? Q how many 3 cliques? Q is Alice connected to Bob? L F Y Q Gg Dd Private network T X Allow aggregate statistics provided facts about individuals are not disclosed 3 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Q A = Q(G) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Q how many edges? A = Q(G) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise 11 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise 11 + 2.3 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise 11 + 2.3 13.3 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise 11 + 2.3 13.3 • Dwork, McSherry, Nissim, Smith [Dwork, TCC 06] have described an answer perturbation mechanism satisfying differential privacy. • Comparatively few results for these techniques applied to graphs. 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise 11 + 2.3 13.3 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Bob Dave Fred Carol Ed Greg ANALYST Q how many edges? A = Q(G) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Q how many edges? A = Q(G) + noise Harry Carol Ed Greg ANALYST Q A = Q(G’) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Harry Q how many edges? A = Q(G) + noise Pr[ A = x | μ = Q(G) ] Carol Ed Greg ANALYST Q A = Q(G’) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Harry Q how many edges? A = Q(G) + noise Pr[ A = x | μ = Q(G) ] =p Carol Ed Greg ANALYST Q A = Q(G’) + noise Harry 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Harry Q how many edges? A = Q(G) + noise Pr[ A = x | μ = Q(G) ] =p Carol Ed Greg ANALYST Harry Q A = Q(G’) + noise Pr[ A = x | μ = Q(G’) ] 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Harry Q how many edges? A = Q(G) + noise Pr[ A = x | μ = Q(G) ] =p Carol Ed Greg ANALYST Harry Q A = Q(G’) + noise Pr[ A = x | μ = Q(G’) ] = q 4 [Dwork, TCC 06] Query answer perturbation DATA OWNER Alice G Dave Fred Alice G’ Bob Bob Dave Fred Carol Ed Greg Harry Carol Ed Greg ANALYST Harry Q how many edges? A = Q(G) + noise Pr[ A = x | μ = Q(G) ] =p differ by at most ε factor of e Q A = Q(G’) + noise Pr[ A = x | μ = Q(G’) ] = q 4 [Dwork, TCC 06] Calibrating noise • The following algorithm for answering Q is ε-differentially private: A Q Q(G) + Noise(ΔQ / ε) ΔQ: Max change in Q, over any two graphs differing by single edge 5 [Dwork, TCC 06] Calibrating noise • The following algorithm for answering Q is ε-differentially private: A Q Q(G) + Noise(ΔQ / ε) true answer ΔQ: Max change in Q, over any two graphs differing by single edge 5 [Dwork, TCC 06] Calibrating noise • The following algorithm for answering Q is ε-differentially private: A Q Q(G) + Noise(ΔQ / ε) true answer sample from Laplace distribution ΔQ: Max change in Q, over any two graphs differing by single edge 5 [Dwork, TCC 06] Calibrating noise • The following algorithm for answering Q is ε-differentially private: privacy parameter A Q Q(G) + Noise(ΔQ / ε) true answer sample from Laplace distribution ΔQ: Max change in Q, over any two graphs differing by single edge 5 [Dwork, TCC 06] Calibrating noise • The following algorithm for answering Q is ε-differentially private: sensitivity of Q A privacy parameter Q Q(G) + Noise(ΔQ / ε) true answer sample from Laplace distribution ΔQ: Max change in Q, over any two graphs differing by single edge 5 Positive results in differential privacy • Some common analyses have low sensitivity: contingency tables, histograms [Dwork, TCC 06] • Data mining algorithms implemented using only low sensitivity queries: PCA, k-Means, Decision Trees [Blum, PODS 05] • Learning theory: possible to learn any concept class with polynomial VC dimension; half-space queries can be learned efficiently [Blum, STOC 08] • Many challenges remain... • Beyond tabular data • Optimal query strategies? 6 [ICDM 09] Accurate degree distribution estimation is possible • Degree distribution: the frequency of each degree in graph. • A widely studied property of networks. Histogram Alice G Bob Dave Fred Carol Ed Greg 4 3 2 1 0 0 1 2 3 4 5 Degree sequence as a vector Harry [1,1,2,2,4,4,4,4] CCDF 8 6 4 2 0 0 1 2 3 4 5 Degree sequence 4 3 2 1 0 1 2 3 4 5 6 7 8 7 [ICDM 09] Accurate degree distribution estimation is possible • Degree distribution: the frequency of each degree in graph. • A widely studied property of networks. Histogram Alice G Bob Dave Fred Carol Ed Greg 4 3 2 1 0 0 1 2 3 4 5 Degree sequence as a vector Harry [1,1,2,2,4,4,4,4] CCDF 8 6 4 2 0 0 1 2 3 4 5 Degree sequence 4 3 2 1 0 1 2 3 4 5 6 7 8 7 [ICDM 09] Accurate degree distribution estimation is possible • Degree distribution: the frequency of each degree in graph. • A widely studied property of networks. Histogram Alice G Bob Dave Fred Carol Ed Greg 4 3 2 1 0 0 1 2 3 4 5 Degree sequence as a vector Harry [1,1,2,2,4,4,4,4] CCDF 8 6 4 2 0 0 1 2 3 4 5 Degree sequence 4 3 2 1 0 1 2 3 4 5 6 7 8 7 [ICDM 09] Accurate degree distribution estimation is possible • Degree distribution: the frequency of each degree in graph. • A widely studied property of networks. Histogram Alice G Bob Dave Fred Carol Ed Greg 4 3 2 1 0 0 1 2 3 4 5 Degree sequence as a vector Harry [1,1,2,2,4,4,4,4] CCDF 8 6 4 2 0 0 1 2 3 4 5 Degree sequence 4 3 2 1 0 1 2 3 4 5 6 7 8 7 [ICDM 09] Accurate degree distribution estimation is possible • Degree distribution: the frequency of each degree in graph. • A widely studied property of networks. Histogram Alice G Bob Dave Fred Carol Ed Greg 4 3 2 1 0 0 1 2 3 4 5 Degree sequence as a vector Harry [1,1,2,2,4,4,4,4] CCDF 8 6 4 2 0 0 1 2 3 4 5 Degree sequence 4 3 2 1 0 1 2 3 4 5 6 7 8 7 Using the sort constraint S(G) = [10, 10, ....10, 10, 14, 18,18,18,18] 8 20 Using the sort constraint ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 S(G) = [10, 10, ....10, 10, 14, 18,18,18,18] 8 20 Using the sort constraint ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 8 20 Using the sort constraint ● ● 15 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● 0 5 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 • The output of the sorted degree query is not (in general) sorted. • We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8 20 Using the sort constraint S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● 15 ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 • The output of the sorted degree query is not (in general) sorted. • We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8 20 Using the sort constraint S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● 15 ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● 0 = 19th smallest degree + noise 0 5 10 15 20 25 • The output of the sorted degree query is not (in general) sorted. • We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8 20 Using the sort constraint ● ● 15 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● 5 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 = 19th smallest degree + noise 0 5 10 15 20 25 • The output of the sorted degree query is not (in general) sorted. • We derive a new sequence by computing the closest nondecreasing sequence: i.e. minimizing L2 distance. 8 original noisy inferred Experimental results 9 19 49 0 1 4 9 19 49 1.0 0.8 0.4 0.2 0.0 0 1 4 9 19 49 99 0 1 4 9 19 49 99 4 9 49 499 0 4 9 49 499 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.6 0.4 0.2 0 1.0 4 1.0 1 1.0 0 0.0 ε=.01 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.6 0.4 0.0 0.2 ε=.001 powerlaw α=1.5, n=5M 1.0 orkut n=3.1M 1.0 livejournal n=5.3M 9 20 After inference, noise only where needed ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 • Standard Laplace noise is sufficient but not necessary for differential privacy. • By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. • Improvement in accuracy will depend on sequence 10 20 After inference, noise only where needed ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 • Standard Laplace noise is sufficient but not necessary for differential privacy. • By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. • Improvement in accuracy will depend on sequence 10 20 After inference, noise only where needed ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 • Standard Laplace noise is sufficient but not necessary for differential privacy. • By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. • Improvement in accuracy will depend on sequence 10 20 After inference, noise only where needed ● 15 ● ● ● ● ● 10 ● S(G) true degree sequence noisy observations (ε = 2) inferred degree sequence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 ● 0 5 10 15 20 25 • Standard Laplace noise is sufficient but not necessary for differential privacy. • By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. • Improvement in accuracy will depend on sequence 10 20 After inference, noise only where needed ● ● 10 15 ● S(G) true degree sequence noisy observations (ε = 2) Mean Squared inferred degree sequence Before inference Θ(n/!2 ) ● ● ● inference ● After ● ● ● ● ● ● ● Error ● ● ● ● ● 3 ● ● 2 O(dlog n/! ● ● ● ● ● ● ●) 0 5 number of distinct degrees 0 5 10 15 20 25 • Standard Laplace noise is sufficient but not necessary for differential privacy. • By using inference, effectively apply a different noise distribution -- more noise where it is needed, less otherwise. • Improvement in accuracy will depend on sequence 10 Conclusion • Possible to accurately estimate degree distributions while providing strong guarantees of privacy • Other findings • Some network analyses cannot be accurately answered under differential privacy (clustering coefficient, motif analysis [Nissim, STOC 07] [PODS 09] ) • Apply inference to other queries (e.g. histograms [CoRR 09]) • Future work: generate accurate synthetic networks under differential privacy? 11 Questions? Additional details on our work may be found here: • [ICDM 09] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of private networks. In International Conference on Data Mining (ICDM), To appear, 2009. • [CoRR 09] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially-private queries through consistency. CoRR, abs/0904.0942, 2009. (under review) • [PODS 09] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship privacy: Output perturbation for queries with joins. In Principles of Database Systems (PODS), 2009. • [VLDB 08] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural identification in anonymized social networks. In VLDB Conference, 2008. http://www.cs.umass.edu/~mhay/ 12 References • [Blum, PODS 05] A. Blum C. Dwork, F. McSherry, and K. Nissim. Practical Privacy: The SuLQ Framework. PODS, 2005. • [Blum, STOC 08] A. Blum, K. Ligett, and A. Roth. Learning Theory Approach to NonInteractive Database Privacy, STOC 2008. • [Onnela, PNAS 07] J. Onnela et al. Structure and tie strengths in mobile communication networks, PNAS 2007. • [Liu, SIGMOD 08] K. Liu and E. Terzi. Towards identity anonymization on graphs. In SIGMOD, 2008. • [Zhou, ICDE 08] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In ICDE, 2008. • [Zou, VLDB 09] L. Zou, L. Chen, and T. Ozsu. K-automorphism: A general framework for privacy preserving network publication. In Proceedings of VLDB Conference, 2009. • [Ying, SDM 2008] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In SIAM International Conference on Data Mining, 2008. 13 References (con’t) • [Cormode, VLDB 09] G. Cormode, D. Srivastava, S. Bhagat, and B. Krishnamurthy. Class-based graph anonymization for social network data. In VLDB Conference, 2009. • [Campan, PinKDD 08] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social networks. In PinKDD, 2008. • [Rastogi, VLDB 07] V. Rastogi, S. Hong, and D. Suciu. The boundary between privacy and utility in data publishing. In VLDB, pages 531–542, 2007. • [Dwork, TCC 06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Third Theory of Cryptography Conference, 2006. • [Nissim, STOC 07] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75–84, 2007. 14