How to exploit network properties to improve learning in relational domains Jennifer Neville Departments of Computer Science and Statistics Purdue University ! ! ! ! (joint work with Brian Gallagher, Timothy La Fond, Sebastian Moreno, Joseph Pfeiffer and Rongjing Xiang) Relational network classification examples Predict organizational roles from communication patterns Email networks! Predict protein function from interaction patterns Gene/protein networks! Predict paper topics from properties of cited papers Scientific networks! Predict content changes from properties of hyperlinked pages World wide web! Predict personal preferences from characteristics of friends Social networks! Predict group effectiveness from communication patterns Organizational networks! Network data is: heterogeneous and interdependent, partially observed/labeled, dynamic and/or non-stationary, and often drawn from a single network ...thus many traditional ML methods developed for i.i.d. data do not apply Machine learning 101 choose 1 Data representation 2 Linear equation Generic form is: Knowledge representation y = "1 x1 + " 2 x 2 ...+ " 0 An example for the Alzheimer’s data would choose defines !CDR = 0.12MM1+ 0.34SBScore..." 0 Machine learning 101 Model space defines combine a would be: 3 re..." 0.34 Objective |D| X 1 Lsq (D) = (f (xi ) N i=1 function choose yi ) 2 Machine learning 101 (eg . op Search timiz ati algorithm o 4 combine n) Learning identifies model with max objective function on training data Model is applied for prediction on new data from same distribution Relational learning Machine learning 101 1 2 Data representation Knowledge representation Email networks! 3 Objective function relational data Social networks! Scientific networks! relational models Bn Bn Firm Broker (Bk) Disclosure Branch (Bn) Bn Size 4 Search algorithm Gene/protein networks! Problem In Past Has Business Is Problem On Watchlist Year Bk Region Type Bk Area Bk Layoffs Bn On Watchlist World wide web! Organizational networks! Bn Bk There has been a great deal of work on templated graphical model representations for relational data RBNs RMNs MLNs GMNs d e e n e w l a icRDNs h p a r g IHRMs o s l a s i n o i ks t r a t o n w e t s e e n r l p e d re o l DAPER e m d o m o m r PRMs f e c s rk Sin o w t e n a t a d h s i u g n i t s i to d Data network Gender? Married? Politics? Religion? Data network 1 Data representation F Y D !C M Y C C Relational learning task: E.g., predict political views based on user’s intrinsic attributes and political views of friends F N D !C M Y D C F N D !C M N C C F N C !C F Y D C (Y|{X}n , G) Estimate joint distribution: PAttributed network or conditional distribution: P (Yi |Xi , XR , YR ) a y l n o e v a nh e t f o e g w n i e n t r a No e l r fo k r o w t e n e singl Define structure of graphical model Politicsi Politics j Politicsi Gender i Politicsi Married i Politics Religion i i Relational template Yi Yj Relational template 1 Yi Xi Yi Xi Yi Xi 2 3 Model template Yi Yj Yi Xi Yi Xi Yi Xi 1 + 2 3 Model template Data network 2 Knowledge representation 2 2 X2 X8 X12 X21 1 1 X32 X24 Y2 3 1 X 1 X X14 Y1 1 3 3 3 X X Y3 X34 X15 2 6 X 1 6 X35 Y5 3 6 X X Y6 X8 Y8 X25 Y4 X23 3 X8 X27 X17 X37 Y7 Model network (graphical model) 3 Objective function Search: eg. convex optimization 4 2 2 X2 X8 X12 X21 1 1 X32 X24 Y2 3 1 X 1 X X14 Y1 1 3 3 3 X X34 X15 X Y3 2 6 X 1 6 X35 Y5 3 6 X X Y6 X8 Y8 X25 Y4 X23 3 X8 X27 X17 X37 Y7 Learn model parameters from fully labeled network P (yG |xG ) = 1 Z( , xG ) T (xC , yC ; T ) T T C C(T (G)) Apply model to make predictions in another network drawn from the same distribution X21 Yi Y j X11 1 Yi Xi Yi Xi Yi X12 3 1 X Y1 + Y2 X13 X33 Y3 X25 X15 X34 Y4 X23 3 X8 Y8 X24 X14 Xi Model template 1 X8 X32 2 3 X28 X22 Y5 X26 X16 X36 Y6 X35 X27 X17 X37 Y7 Test network Collective classification uses full joint, rolled out model for inference… but labeled nodes impact the final model structure 2 1 X X11 2 1 X X X Y2 Y1 2 81 2 2 X X 1 3 2 1 3 1 X1 X1 Y1 2 2 1 2 X32 1 X8 X32 Y2 2 X X3 X23 X13 X3 X33Y4 3 Y3 Y8 X24 X14 Y4 1 3 X X8 X X28 X38 3 X8 Y8 X25 X15 X34 Y4 Y4 X26 1 6 Y4 X Y3 Labeled node X16 Y6 X15 Y5 2 36 X X6 Y6 X3 25 X5 X35 2 Y5 X7 X2 1 37 X X 7 7 X17 X37 X36 Y7 Y7 Collective classification uses full joint, rolled out model for inference… but labeled nodes impact the final model structure The structure of “rolled-out” relational X X X X X X graphical models are determined by the Y Y Y Y X X structure of the underlying data network, X X X X including location + availability of labels Y Y X X X X X X X X X …this can impact performance of LabeledlearningY and inference Y Y methods , n o i t a node present X21 1 1 11 22 3 1 11 2 X828 X2222 11 88 33 22 22 44 22 11 44 22 33 11 33 22 55 88 11 55 33 44 33 55 22 66 44 33 33 33 33 88 11 66 33 66 66 22 77 55 11 77 33 77 77 , n via re o i t c n u f e v i t c e hm t i obj r o g l a h c r a e s and Networks are much, much larger in practice… Finding 1: Representation Implicit assumption is that nodes of the same type should be identically distributed—but many relational representations cannot ensure this holds for varying graph structures I.I.D. assumption revisited • Current relational models do not impose the same marginal invariance condition that is assumed for IID models, which can impair generalization p(yA |xA ) A B E C D F p(yE |xE ) p(yA |xA ) 6= p(yE |xE ) due to varying graph structure • Markov relational network representation does not allow us to explicitly specify the form of the marginal probability distributions, thus it is difficult to impose any equality constraints on the marginals Is there an alternative approach? • Goal: Combine the marginal invariance advantages of IID models with the ability to model relational dependence • Incorporate node attributes in a general way (similar to IID classifiers) • Idea: Apply copulas to combine marginal models with dependence structure t1 t2 t3 ... tn t jointly ~ Copula theory: can construct n-dimensional vector( of 1) arbitrary zi = F i ( i (ti )) marginals while preserving the desired dependence structure F z1 z2 z3 ... zn zi marginally ~ Fi Let’s start with a reformulation of IID classifiers... • General form of probabilistic binary classification: p(yi = 1) = F (⌘(xi )) • e.g., Logistic regression • Now view F as the CDF of a distribution symmetric around 0 to obtain a latent variable formulation: zi ⇠ p(zi = z|xi = x) = f (z ⌘(xi )) ! yi = sign(zi ) • z is a continuous variable, capturing random effects that are not present in x • p is the corresponding PDF of F • In IID models, the random effect for each instance is independent, thus can be integrated out • When links among instances are observed, the correlations among their class labels can be modeled through dependence among the z’s • Key question: How to model the dependence among z’s while preserving the marginals? ? Zj Copula Latent Markov Network (CLMN) IID classifiers CLMN The CLMN model • Sample t from the desired joint dependency: (t1 , t2 , . . . , tn ) ⇠ to obtain • Apply marginal transformation ( 1) the latent variable z: zi = Fi • Classification: yi = sign(zi ) ( i (ti )) Marginal Φi transforms ti to uniform [0,1] r.v. ui Quasi-inverse of CDF Fi is used to obtain zi from ui, Attributes moderate corresponding pdf fi Copula Latent Markov Network (Xiang and N. WSDM‘13) CLMN implementation Estimation: • First, learn marginal model as if instances were IID Gaussian Markov network Logistic regression • Next, learn the dependence model conditioned on the marginal model... but GMN has no parameters to learn Inference: • Conditional inference in copulas have not previously been considered for largescale networks • For efficient inference, we developed a message passing algorithm based on EP Experimental Results CLMN SocDim RMN Key idea: Ensuring that nodes with varying graph Facebook GMN LR structure have identical marginals improves learning IMDB Gene IMDB Finding 2: Search Graph+attribute space is too large to sample thoroughly, but efficient generative graph models can be exploited to search more effectively How to efficiently generate attributed graph samples from the underlying joint distribution P (X, Y, G) ? V 2 +V ·p Space is O(2 ) so effective sampling from joint is difficult Naive sampling approach: Assume independence between graph/attributes PE (X, E|⇥E , ⇥X ) = PE (E|⇥E )P (X|⇥X ) Attributes Graph Model Attribute Model Problem with naive approach by d e r u t p ca e b n a te c u b e i r r t u t t a c u h it tr s w h g p n i a r r i g a p Sampled n e v o i i t a a n l , e Although hOriginal s r l or de c o l m a n o p i a t gr la e e r v i t e r a r u e t p a gen c t o n es o d s e l p sam 0-0 1-1 0-1 Attribute value combinations Solution: Use graph model to propose edges, but sample conditional on node attribute values PE (X, E|⇥E , ⇥X ) = PE (E|X, ⇥E , ⇥X )P (X|⇥E , ⇥X ) Attributes Graph Model t c e j e t-R p e c c A e l use p m a s o t s s e c s r t t pro a n o d e n o i t i cond Attribute Model Exploit efficient generative graph model as proposal distribution to search effectively • What to use as acceptance probabilities? Ratio of observed probabilities in original data to sampled probabilities resulting from naive approach ! ! ! ! ! Original Sampled Original 0-0 1-1 0-1 Attribute value combinations 0-0 1-1 0-1 Attribute value combinations • This corresponds to rejection sampling • Proposing distribution: PE (Eij = 1|⇥E ) • True distribution: Po (Eij = 1|f (xi , xj ), ⇥E , ⇥X ) 33 Attributed graph models (Pfeiffer, La Fond, Moreno, N & Gallagher WWW’14) # # # # ! Learn attribute and graph model Generate graph with naive approach Compute acceptance ratios Sample attributes while not enough edges: draw (vi,vj) from Q’ (the model) U ~ Uniform(0,1) if U < A(xi, xj) put (vi, vj) into the edges return edges 0-0 1-1 Attribute value combinations a e Possible Edges b b d f g h 0-1 c f g i h Theorem 1: AGM samples from the joint distribution of edges and attributes P (Eij = 1|f (xi , xj ), ⇥E , ⇥F )P (xi , xj |⇥X ) Corollary 2: Expected AGM degree equals expected degree of structural graph model Empirical results on Facebook data Correlation Political views 0.4 AGM preserves 0.3 characteristics Key idea: Statistical models of graphs of graph model can be exploited to improve sampling from full jointPE (E, X|⇥ E , ⇥ X ) 0.2 AGM captures attribute correlation 0.1 AGM No AGM 0.0 Facebook AGM-KPGM (2x2) TCL AGM-FCL AGM-KPGM (3x3) KPGM (2x2) AGM-TCL FCL KPGM (3x3) Relational learning 1 Data representation 2 Knowledge representation Representations affect our ability to enforce invariance assumptions Objective function Conventional obj. functions do not behave as expected in partially labeled networks Search algorithm Simpler (graph) models can be used to statistically “prune” search space 3 4 (not in this talk) Conclusion • Relational models have been shown to significantly improve predictions through the use of joint modeling and collective inference • But since the (rolled-out) model structure depends on the structure of the underlying data network… • …we need to understand how the data graph affects model/algorithm characteristics in order to better exploit relational information for learning/prediction • A careful consideration of interactions between: data representation, knowledge representation, objective function, and search algorithm will improve our understanding of mechanisms that impact performance and this will form the foundation for improved algorithms & methodology Thanks to: Alum ni Hoda Eldardiry Rongjing Xiang Chris Mayfield Karthik Nagaraj Umang Sharan Sebastian Moreno Nesreen Ahmed Hyokun Yun Suvidha Kancharla Tao Wang Timothy La Fond Joel Pfeiffer Ellen Lai Pablo Granda Hogun Park Questions? ! [email protected] www.cs.purdue.edu/~neville