Biomedical Data Mining with Matrix Models SDM 2016 Tutorial Part I May 5, 2016 Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut [email protected] Biomedical Data EHR Electronic Health Record (EHR) is an evolving concept defined as a systematic collection of electronic health information about individual patients or populations Jensen, Peter B., Lars J. Jensen, and SØren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012). Medical Imaging X-ray Computed Tomography (CT) Positron Emission Tomography (PET) Magnetic Resonance Imaging (MRI) Drug A chemical compound is a pure chemical substance consisting of two or more different chemical elements that can be separated into simpler substances by chemical reactions https://pubchem.ncbi.nlm.nih.gov/ Drug The term biological target is frequently used in pharmaceutical research to describe the native protein in the body whose activity is modified by a drug resulting in a desirable therapeutic effect. In this context, the biological target is often referred to as a drug target. http://www.uniprot.org/help/uniprotkb Gene DNA: A long molecule that looks like a twisted ladder. It is made of four types of simple units and the sequence of these units carries information, just as the sequence of letters carries information on a page. Gene expression: The process in which the information encoded in a gene is converted into a form useful for the cell. The first step is transcription, which produces a messenger RNA molecule complementary to the DNA molecule on which a gene is encoded. For protein-coding genes, the second step is translation, in which the messenger RNA is read by the ribosome to produce a protein. All definitions from Wikipedia Gene: A segment of DNA. Genes are like sentences made of the "letters" of the nucleotide alphabet, between them genes direct the physical development and behavior of an organism. Genes are like a recipe or instruction book, providing information that an organism needs so it can build or do something like making an eye or a leg, or repairing a wound. A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. Physiology Data Physiology is the scientific study of function in living systems. A sub-discipline of biology, its focus is in how organisms, organ systems, organs, cells, and bio-molecules carry out the chemical or physical functions that exist in a living system Patient Survey Online Social Media Environmental Data 1. Clinicalquestions 2. Insightsfromthedata 3. Predictivemodels From Nigam Shah’s slides Matrix? Vector Collection Feature x1 x2 x3 x4 ... ... ... xN Longitudinal Records Wang, Fei, et al. "Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. Graph Heart Attack Expanded Wheezing Expanded Cough Ankle Edema Initial Fever Expanded Chest Pain Rales Initial Sondhi, Parikshit, et al. "SympGraph: a framework for mining clinical notes through symptom relation graphs." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. Graph Tom, Male,30 Hypertension John, Male,53 HeartFailure Roy,Male,40 Sepsis Sara,Female,28 Asthma Natasha,Female,42 Hyperlipidemia Jack,Male,30 Pneumonia Wang, Fei, and Jimeng Sun. "PSF: A Unified Patient Similarity Evaluation Framework Through Metric Learning With Weak Supervision." Biomedical and Health Informatics, IEEE Journal of 19.3 (2015): 1053-1060. Graph Liu, Chuanren, Fei Wang, Jianying Hu, and Hui Xiong. "Temporal Phenotyping from Longitudinal Electronic Health Records: A Graph Based Framework." In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 705-714. ACM, 2015. Drug Graph Precision Medicine Drug Repurposing Prognosis/Risk Prediction Patient Graph Drug-Target Interaction DNA Sequencing Disease-Gene Association Disease Graph 19 Gene Graph Matrix Methods Spectrum Analysis Spectral Clustering •Algorithms that cluster points using eigenvectors of matrices derived from the data •Obtain data representation in the low-dimensional space that can be easily clustered •Variety of methods that use the eigenvectors differently •Difficult to understand…. Graph Theory Basics • A graph G = (V,E) consists of a vertex set V and an edge set E. • If G is a directed graph, each edge is an ordered pair of vertices • A bipartite graph is one in which the vertices can be divided into two groups, so that all edges join vertices in different groups. Similarity Graph •Distance decrease similarity increase •Represent dataset as a weighted graph G(V,E) •Wij represent similarity between vertex •If Wij=0 where isn’t similarity; Wii=0 V={xi} E={Wij} Set of n vertices representing data points Set of weighted edges indicating pair-wise similarity between points 0.1 0.8 5 1 0.8 0.8 0.6 2 6 4 0.8 3 0.2 0.7 Graph Partitioning •Clustering can be viewed as partitioning a similarity graph •Bi-partitioning task: •Divide vertices into two disjoint groups (A,B) A 1 2 4 3 B 5 6 Partitioning Criterion • Traditional definition of a “good” clustering: • Points assigned to same cluster should be highly similar. • Points assigned to different clusters should be highly dissimilar 0.8 0.1 1 5 0.8 0.6 2 0.8 0.8 3 6 4 0.2 0.7 Graph Cut •Express partitioning objectives as a function of the “edge cut” of the partition. •Cut: Set of edges with only one vertex in a group.we wants to find the minimal cut between groups. The groups that has the minimal cut would be the partition A B 0.1 0.8 1 0.8 0.6 2 0.8 3 5 6 4 0.2 0.8 0.7 Min-Cut Minimize weight of connections between groups Optimal cut Minimum cut Normalized Cut •Consider the connectivity between groups relative to the density of each group •Normalize the association between groups by volume. • Vol(A): The total weight of the edges originating from group A. • Why use this criterion? • Minimizing the normalized cut is equivalent to maximizing normalized association. • Produces more balanced partitions. Spectral Graph Theory •Possible • • approach Represent a similarity graph as a matrix Apply knowledge from Linear Algebra… • The eigenvalues and eigenvectors of a matrix provide global information about its structure. • Spectral Graph Theory • Analyze the “spectrum” of matrix representing a graph. • Spectrum : The eigenvectors of a graph, ordered by the magnitude(strength) of their corresponding eigenvalues Matrix Representation 0.1 0.8 5 1 0.8 0.6 2 0.8 0.8 6 4 0.7 0.2 3 0.1 0.8 5 1 0.8 0.6 2 0.8 3 6 4 0.2 0.8 0.7 x1 x2 x3 x4 x5 x6 x1 0 0.8 0.6 0 0.1 0 x2 0.8 0 0.8 0 0 0 x3 0.6 0.8 0 0.2 0 0 x4 0 0 0.2 0 0.8 0.7 x5 0.1 0 0 0.8 0 0.8 x6 0 0 0 0.7 0.8 0 x1 x2 x3 x4 x5 x6 x1 1.5 0 0 0 0 0 x2 0 1.6 0 0 0 0 x3 0 0 1.6 0 0 0 x4 0 0 0 1.7 0 0 x5 0 0 0 0 1.7 0 x6 0 0 0 0 0 1.5 Laplacian Matrix x1 x2 x3 x4 x5 x6 x1 0 0.8 0.6 0 0.1 0 x2 0.8 0 0.8 0 0 0 x3 0.6 0.8 0 0.2 0 0 x4 0 0 0.2 0 0.8 0.7 0 x5 0.1 0 0 0.8 0 0.8 1.5 x6 0 0 0 0.7 0.8 0 x1 x2 x3 x4 x5 x6 x1 1.5 0 0 0 0 0 x2 0 1.6 0 0 0 0 x3 0 0 1.6 0 0 0 x4 0 0 0 1.7 0 0 x5 0 0 0 0 1.7 x6 0 0 0 0 0 - 0.1 0.8 5 1 0.8 0.6 2 0.8 6 4 0.7 3 0.2 0.8 = x1 x2 x3 x4 x5 x6 x1 1.5 -0.8 -0.6 0 -0.1 0 x2 -0.8 1.6 -0.8 0 0 0 x3 -0.6 -0.8 1.6 -0.2 0 0 x4 0 0 -0.2 1.7 -0.8 -0.7 x5 -0.1 0 0 0.8- 1.7 -0.8 x6 0 0 0 -0.7 -0.8 1.5 Normalized Laplacian 0.1 0.8 5 1 0.8 0.6 2 0.8 0.8 6 4 1.00 -0.52 -0.39 0.00 -0.06 0.00 -0.52 1.00 -0.50 0.00 0.00 0.00 -0.39 -0.50 1.00 0.00 0.00 0.00 0.00 -0.12 1.00 -0.47 -0.44 -0.06 0.00 0.00 0.47- 1.00 -0.50 0.00 0.00 0.00 -0.44 -0.50 1.00 -0.12 0.7 3 0.2 Spectral Clustering • Three basic stages: • Pre-processing • Construct a matrix representation of the dataset. • Decomposition • Compute eigenvalues and eigenvectors of the matrix. • Map each point to a lower-dimensional representation based on one or more eigenvectors. • Grouping • Assign points to two or more clusters, based on the new representation. 625 25 Magic Sigma σ = 0.035897 σ = 0.015625 = 0.054409 σ =σ 0.035897 σσ = 0.041235 = 0.054409 σ = 0.054409 σ0.015625 = 0.35355 σσ==0.35355 σ = 0.03125 σσ == 0.035897 0.35355 σ = 0.03125 σ=1 σσ==0.35355 1 σ=1 Figure 1: Spectral clustering without local scaling (usin When the data incorporates multiple scales standard spec the optimal σ for each example (displayed on each figure) t row: Clustering results for the top-left point-set with diffe the high impact σ has on the clustering quality. In all the was set manually. The data points were normalized to occ Local Scaling Instead of selecting a single scaling parameter σ, calculate a local scaling parameter σi for each data point Zelnik-Manor, Lihi, and Pietro Perona. "Self-tuning spectral clustering." Advances in neural information processing systems. 2004. Wine Spill Spill a drop of wine on the cloth Spread/diffuse to the neighborhood 37 38 Label Propagation 5 4 7 6 8 3 10 1 9 2 j and i are linked by an edge Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. "Learning with local and global consistency." Advances in neural information processing systems 16, no. 16 (2004): 321-328. Perron-Frobenius Theorem Label Propagation … If W is a stochastic matrix, its spectral radius will be no larger than 1 Will this procedure converge? 39 40 Stochastic Matrix We can use the row-normalized similarity matrix as the new similarity matrix Row Normalization The similarities are no longer symmetric Xiaojin Zhu, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002. 41 Bi-Stochastic Matrix We want a stochastic matrix which is symmetric A right stochastic matrix is a real nonnegative square matrix, with each row summing to 1. A left stochastic matrix is a real nonnegative square matrix, with each column summing to 1. row normalization column normalization A bi-(doubly) stochastic matrix is a real nonnegative square matrix, with both row and column summing to 1. row normalization …… Sinkhorn-Knopp Algorithm This procedure will converge to a bi-stochastic matrix The solution that the SK algorithm converges to is the bi-stochastic matrix closest to W under KL-divergence Sinkhorn, Richard, & Knopp, Paul. (1967). "Concerning nonnegative matrices and doubly stochastic matrices". Pacific J. Math. 21, 343–348. Fei Wang, Ping Li, Arnd Christian König. "Learning a bi-stochastic data similarity matrix." IEEE ICDM (2010). Factorization Matrix Factorization Matrix Factorization X X ~ ~ F x F G x G Probabilistic Matrix Factorization Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic matrix factorization." NIPS, 2011. Nonnegative Matrix Factorization Nonnegative Matrix Factorization • Factorizinganonnegativematrixtothe productoftwolow-rankmatrices Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788-791 (21 October 1999) Learning Medical Concepts Zhou, Jiayu, Fei Wang, Jianying Hu, and Jieping Ye. "From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records." In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 135-144. ACM, 2014. Learning Medical Concepts ▪ Assume that each patient has different medical concepts from other patients Xi Xi Xi Ui Ui Ui Vi Vi Vi ▪ Formulation Matrix Completion Completion via matrix factorization. Enforce a low rank factorization Ui, Vi and encourage the values of UiVi at the observed location to be close to the observed ones. Meaningful Medical Concepts Medical concepts involve nonnegative components of medical features. Sparsity Controllable sparsity that encourages a few medical features in each concept. Temporal Smoothness Couple the columns of Vi and force them to close to each other. Overfitting Control Prevent Vi from overfitting. Zhou, Jiayu, Fei Wang, Jianying Hu, and Jieping Ye. "From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records." In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 135-144. ACM, 2014. Dictionary Learning Seek for a set of basis which can sparsely represent the data set Data matrix Coding coefficient matrix, n by k Basis matrix Non-negativity Enforce a sparse representation of the i-th data object Dictionary Learning Seek for a sparse representation of the data in the c-th group A common dictionary shared over all C groups of data The data groups are pre-defined The dictionary is shared over all data groups Bengio, Samy, et al. "Group sparse coding." Advances in neural information processing systems. 2009. Dictionary Learning Learn both shared and group-specific dictionaries Shared dictionary Group specific dictionary for group c Wang, Fei, et al. "Automatic Group Sparse Coding." AAAI. 2011. Longitudinal Records Wang, Fei, et al. "Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. Temporal Patterns F. Wang et al. Towards Heterogeneous Temporal Clinical Event Pattern Discovery: A Convolutional Approach. KDD 2012. One-Side Convolution One-Side Convolutional NMF Longitudinal Records How$to$interpret$ and$make$use$of$the$ sequentiality$of$the$ events?$ The$sequentiality$of$those$ events$may$indicate$some$$ impending$disease$conditions$ Diagnosis(A( Diagnosis(D( Medication(B( Medication(B( Lab(Test(C( ... to a variety of data and application domains identification quence in Figure 1. In the sequence, we have 6 observations 4 montheventlarge-scale longitudinal data.we present the basic definition of 4 of unique events. We show the interval between Figure 1. from patient EHRs. First Overview Temporal graph graph weconstruction propose this provides an happening edge between the temporal graph and in how it paper is constructed. eoral observed event sequences, we construct the fol-timestamps along the ordered Diagnosis D width to Medica way to represent the temporal knowledges conobservations. In the graph, we use the edge sig1 day 3.1time Temporal graph construction mporal graph for each sequence sn : screte data. Theconstruction temporal graphsI capture nify the weighted computed with the input sequence. As Temporal graph Suppose havesequences a set of in event sequences : n be = seen, the parameters , r control the Blocality uctures hiddenwe in the a more com- {sn can of the Figure 1: One example of medical event sequenc A Graph Based Formulation ·1 · · nodes , N (Temporal } where is theare number of appeared sequences. Eachedge event ion1,the graph). The temporal graph here in theNgraph events computation in(potential the temporal graph. Namely, a larger subject patient). is denoted by sn = l = 1, · ·r· captures , Ln ) nl , tnl ) : reandsequence thesn directed edges encode the((x temporal the similarities among events in a longer tempoence is a directed and weighted graph with our wherepairwise Ln is the lengthInofthis sn . representation, In other words, we can obbetween events. ral range, which potentially increase the connectivity of the s its node set {1, · · · , M }, where the weight of the serve event x at time t in the sequence s . We let the nl nl Definition (Temporal graph) missing in patient EHRs will not appear in then temporal graph, while a small r only considers closely adevents xnlnode 2 {1, · · · is , M defined } and tnp tnq , for all p < q. We node i to as he repeated pairwise jevents with the same orderjacent symbols as similar, which make A the temporal graph n have one example of the medical event sequences of potential The temporal graph G With of sequence sn is a directed and weighted appear once in the graph. this represenmore spread. In the extreme case when r approaches infinity, C X patients in Figure 1. n 1emporal graphgraph with is our eventand setresistant as its node set {1, · · · W , M}, where an thealmost constant matrix, since all appearing robust to sparse, becomes With the observed event sequences, we (t construct [xedge i ^node xnqthis =torepresenj] tthe ,aspairs (1)will be fully and equally connected. np = nq np )folrregular observations. Moreover, event weight of the from i node j is defined n lowing temporal graph for each sequence sn : 1pqL y intuitive andnhighly interpretable, because one The parameters ( Figure and r) canThe be selected according to temporal B 2: D graph X Definition 1 1 (Temporal graph). nderstand the temporal relationships amongThe dif-temporal thegraph application. For example, if there is little correlation n n W = [x = i ^ x = j] (t ) , happened with a time interval larger than np nq graph with r between nq G of sequence snEHRs. is a directed and advantage weighted our tnp ij is events a non-increasing function. al in patient Another events Ln set {1, · · · , M }, where the weight of the event set as its node 1pqL graph based representation, then detected pheno3 months, then we can let = 3 months. The scaling edgeis from node node j isofdefined as whichthus parameter tterns) will beiintothe form graphs, (·) a also non-increasing function, the morer can also be empirically setC to be the average X where r 1is a non-increasing function and r > 0time is a interval parameter. d as a nature aggregation of sequential patterns. n t i and event j appear toj]each in sn , between consecutive events. In the example, [xnp = close i ^ xnq = (tnq other tnp ) , (1) Wij = we can e↵ectively alleviate the pattern explosion we use = 3 months and r = 5 days. In ourconstructed empirical study t With allgraph the n Ln 1pqL n Figure 2: The temporal of event sequence in the Wij is. In this paper, we use the exceedance on real-world EHR data warehouse, we optimize r based on identify the temporal phenoty where (·)distribution is a non-increasing phenotyping performance in specific applications. ponential tofunction. construct the the temporal poral graph framework for Records phenotype emporal Phenotyping frombased Longitudinal Electronic Health 3.2 Temporal phenoty explain the observations. Our HODOLOGY Note that (·) is a non-increasing function, thus the more 3.2 Temporal phenotyping 4 / 29 basis as the phenoty often i and appear of close each other inDiagnosis sn , A Medication B temporal Lab Test C we ction we event will introduce thej details ourtotem( nevent With all the constructed temporal graphs, 1 day 1 week the higher the Wfor In this paper, we use the exceedance ij is. construct the observed based framework phenotype identification identify the temporal phenotypes that can temp be used exp( /r) of the Exponential distribution to construct the temporal 4Our month ) = (2) EHRs.(First we present the basic definition of explain the observations. idea isexample, to compute tw have one simplified graph: 0 > aph and how it is constructed. basis as the temporal phenotypes which can be us ( and one observed graph ca Diagnosis D temporal Medication constructsis, the observed graphs. InBFigu exp( /r) 1 day ( ) a = smaller edge weight for a(2) poralwe graph construction have one simplified example, where we have g ords, compute larger of the first two basis. In three pract 0 > sis, and one observed graph can be expressed as the we have a set ofevent. Otherwise, sequences {sn we : n ignore = val , when the event intwo the beginning, and our tem example ofbasis. medical event sequence ofknow one words, we compute a smaller edgeevent weight for aFigure larger 1: One of the first In practice, we do not hereInNother is the number of sequences. Each (potential time intervalhappened , when with . Otherwise, we interval ignore thesubject event in the patient). beginning, and our identifying temporal phenotyping the events a time larger the process the ui An Example Graph An#$inflammatories. Glucocor#coids. Pulmonary.Disease. Asthma. Diure#cs. ACE.Inhibitors. Conges#ve.Heart.Failure. Mineral.Replacement. An#anginal.Agents. Laxa#ves. We call the resultant graph basis as temporal ph since they are derived from the temporal graphs temporal phenotypes capture evolving patterns of Figure 3: Example of temporal phenotyping. Graph Basis conditions hidden in the event sequences. To b wh suppose we have constructed the temporal graph G Graph bases as temporal phenotypes n sim sequence s , and G is associated with the adjacen n 1 tions as ⌦(A) to n M ⇥M n 1 matrix W 1 2 R . To reconstruct G , we we assu patients under stu k 2 3 4 We call the resultant graph basis temporal are as K graph basis Bphenotypes, 2 RM ⇥M for k = 1, 2, · · · , cha 2 4 can be used to approximate the adjacency matrix since they are derived from the temporal graphs, and the 5 3.3.1 Similarit K temporal phenotypes capture evolving patterns of ntheX healthk W = Ank B , conditions hidden in the event sequences. To be specific, k=1 In the first case, n tween patients ind suppose we have1 constructed the temporal graph G for each N ⇥K wh where A 2 R is the matrix of reconstruction co n orbasis control). We c sequence sn , and G is associatedTo with thethe adjacency weight compute optimal graph and the recon tha similar phenotypin coefficients, minimize the total reconstruction matrix W n2 2 R3M ⇥M4 . To reconstruct Gn ,wewe assume there ma regularization: N K X are K graph basis B k 2 RM ⇥M for k = 1, 2, · · · 1, X K, which n k the 2 J (A, B) = kW A B k nk F, n 5 2 n=1 can be used to approximate the adjacency matrix W : k=1 ⌦(A) = 4 5 5 3 …… 3. where k · kF is the matrix Frobenius norm. Figure 3: Example ofntemporal phenotyping. N ⇥N To make the solutions morewhere interpretable, we als k S 2 R W = Ank , twoBconstraints on the reconstruction coefficients similarity informa k the graph basis B for k = 1, 2, · · · , K. The first I k=1 we can k just equiva is about the non-negativity, i.e., B 0 for all k, We call the resultant graph basis as temporal phenotypes, ma changing ⌦(A). It N ⇥K original temporal graphs are non-negative. The se since they the temporal graphs, and thecoefficients. where A 2areRderivedisfrom the matrix of reconstruction P K X we introduced earlier, the reconstruction coefficients in consider the hin nd k Ank = 1, for n As = 1, · · · , N, 1 A can be used for a various of applications. In particular, for Pr(An , Yn |H) = ,loss(A A to be valid multinomial distributhe medical diagnosis application, our goal is to derive1 infor+ exp( Yn f (An )) quantify each patient by the tem- to improve the diagnosis performance, i.e., mative features and where the linear model H : A ! 7 f (A probabilities can in turn knowledge n ) = An ⇥ + as ⌦(A) to which incorporate additional on groups the for the patients.nTo the be classification of control/case (⇥, ✓) are parameters model H. It follows that⌦ edicine, patient ntsRegularization under study.segmentation, this end, weand extend the temporal phenotyping in forthe temporal 1 graphs with regularization ⌦(A) Similarity based regularization N 0: log Pr(A n , Yn |H) = log(1 + exp( Yn f (An ))). K 3.4 Implem X X 1 n n addition to k 2 In the log-loss for probabilistic model, J (A, B) = kW Ank B kF + ⌦(A), (4) the first case, we may have implicit 2similarity links can also be used with the linear model H n=1 loss terms k=1 beWe give the n patients indicating they are from the same group (case er, the reconstructionwhere coefficients in consider the hinge loss for (A , Yn ): 0 is the parameter controlling the degree ofnregphenotyping p ntrol). We can encourage the linked patients to have us of applications. In ularization. particular, In forthe following, we propose several regularizaspecial case loss(An , Yn |H) = max{0, 1 Yare n f (An )}, ar phenotyping representations in A using the following plication, our goal is to derive inforarization: ve the diagnosis performance, i.e., and ol/case groups1forX the1patients. To 2 1 X ⌦(A) = loss(An , Yn |H). ⌦(A) = kAtemporal A n2 k S n 1 n 2 , n1 emporal phenotyping for |L| n2L 2 n ,n 2 1 2 n ⌦(A) 0: e S X 2K RN ⇥N > 0 is symmetric matrix 3.4 encoding the Implementation n arity information. that, when Ank B k k2F +Note ⌦(A), (4) S is asymmetric, 0 an just k=1equivalently replace S with (S + S )/2 without We give the implementation of the regularized tem ging ⌦(A). It follows that Sn1 n2 = Sn2 n1 and meter controlling the degree of regphenotyping problems, since the un-regularized pro ing, we propose several1regularizaare special cases with = 0. ⌦(A) = tr(A0 LA), (5) 2 e L = DP S and D is the diagonal degree matrix such Dnn = n0 Snn0 . Note that, some rows/columns of S be completely zero if we do not have knowledge about Chuanren Liu, Fei Wang, Jianying Hu and Hui Xiong. Temporal Phenotyping with Electronic Health Records: A Graph Based orresponding patients, e.g., the instances in the test set. Approach. KDD 2015. Clustering Ding, Chris, et al. "Orthogonal nonnegative matrix t-factorizations for clustering." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006. Questions? [email protected]