foundations

Biomedical Data Mining
with Matrix Models
SDM 2016 Tutorial Part I
May 5, 2016
Fei Wang
Associate Professor
Department of Computer Science and Engineering
School of Engineering
University of Connecticut
[email protected]
Biomedical Data
EHR
Electronic Health Record (EHR) is an evolving concept defined as a systematic collection of
electronic health information about individual patients or populations
Jensen, Peter B., Lars J. Jensen, and SØren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).
Medical Imaging
X-ray Computed
Tomography (CT)
Positron Emission
Tomography (PET)
Magnetic Resonance
Imaging (MRI)
Drug
A chemical compound is a pure chemical substance
consisting of two or more different chemical elements that can
be separated into simpler substances by chemical reactions
https://pubchem.ncbi.nlm.nih.gov/
Drug
The term biological target is frequently used in pharmaceutical research to
describe the native protein in the body whose activity is modified by a drug
resulting in a desirable therapeutic effect. In this context, the biological
target is often referred to as a drug target.
http://www.uniprot.org/help/uniprotkb
Gene
DNA: A long molecule that looks
like a twisted ladder. It is made of
four types of simple units and the
sequence of these units carries
information, just as the sequence
of letters carries information on a
page.
Gene expression: The process in
which the information encoded in a
gene is converted into a form useful
for the cell. The first step is
transcription, which produces a
messenger RNA molecule
complementary to the DNA molecule
on which a gene is encoded. For
protein-coding genes, the second
step is translation, in which the
messenger RNA is read by the
ribosome to produce a protein.
All definitions from Wikipedia
Gene: A segment of DNA. Genes are
like sentences made of the "letters" of
the nucleotide alphabet, between them
genes direct the physical development
and behavior of an organism. Genes
are like a recipe or instruction book,
providing information that an organism
needs so it can build or do something like making an eye or a leg, or
repairing a wound.
A Single Nucleotide
Polymorphism (SNP,
pronounced snip; plural snips) is
a DNA sequence variation
occurring commonly within a
population (e.g. 1%) in which a
single nucleotide — A, T, C or G
— in the genome (or other shared
sequence) differs between
members of a biological species
or paired chromosomes.
Physiology Data
Physiology is the scientific study of function in living systems. A
sub-discipline of biology, its focus is in how organisms, organ
systems, organs, cells, and bio-molecules carry out the chemical
or physical functions that exist in a living system
Patient Survey
Online Social Media
Environmental Data
1. Clinicalquestions
2. Insightsfromthedata
3. Predictivemodels
From Nigam Shah’s slides
Matrix?
Vector Collection
Feature
x1
x2
x3
x4
...
...
...
xN
Longitudinal Records
Wang, Fei, et al. "Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach." Proceedings of the
18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.
Graph
Heart
Attack
Expanded
Wheezing
Expanded
Cough
Ankle
Edema
Initial
Fever
Expanded
Chest
Pain
Rales
Initial
Sondhi, Parikshit, et al. "SympGraph: a framework for mining clinical notes through symptom relation graphs." Proceedings of the
18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.
Graph
Tom, Male,30
Hypertension
John, Male,53
HeartFailure
Roy,Male,40
Sepsis
Sara,Female,28
Asthma
Natasha,Female,42
Hyperlipidemia
Jack,Male,30
Pneumonia
Wang, Fei, and Jimeng Sun. "PSF: A Unified Patient Similarity Evaluation Framework Through Metric Learning With Weak
Supervision." Biomedical and Health Informatics, IEEE Journal of 19.3 (2015): 1053-1060.
Graph
Liu, Chuanren, Fei Wang, Jianying Hu, and Hui Xiong. "Temporal Phenotyping from Longitudinal Electronic Health Records: A Graph Based
Framework." In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 705-714. ACM, 2015.
Drug Graph
Precision
Medicine
Drug
Repurposing
Prognosis/Risk
Prediction
Patient Graph
Drug-Target
Interaction
DNA
Sequencing
Disease-Gene
Association
Disease Graph
19
Gene Graph
Matrix Methods
Spectrum Analysis
Spectral Clustering
•Algorithms
that cluster points using eigenvectors of
matrices derived from the data
•Obtain
data representation in the low-dimensional
space that can be easily clustered
•Variety
of methods that use the eigenvectors differently
•Difficult
to understand….
Graph Theory Basics
•
A graph G = (V,E) consists of a vertex set V and an
edge set E.
•
If G is a directed graph, each edge is an ordered pair
of vertices
•
A bipartite graph is one in which the vertices can be
divided into two groups, so that all edges join
vertices in different groups.
Similarity Graph
•Distance
decrease similarity increase
•Represent dataset as a weighted graph G(V,E)
•Wij represent similarity between vertex
•If Wij=0 where isn’t similarity; Wii=0
V={xi}
E={Wij}
Set of n vertices representing data points
Set of weighted edges indicating pair-wise
similarity between points
0.1
0.8
5
1
0.8
0.8
0.6
2
6
4
0.8
3
0.2
0.7
Graph Partitioning
•Clustering
can be viewed as partitioning a similarity graph
•Bi-partitioning
task:
•Divide vertices into two disjoint groups (A,B)
A
1
2
4
3
B
5
6
Partitioning Criterion
•
Traditional definition of a “good” clustering:
• Points assigned to same cluster should be highly similar.
• Points assigned to different clusters should be highly dissimilar
0.8
0.1
1
5
0.8
0.6
2
0.8
0.8
3
6
4
0.2
0.7
Graph Cut
•Express
partitioning objectives as a function of the “edge cut” of the
partition.
•Cut: Set of edges with only one vertex in a group.we wants to find the
minimal cut between groups. The groups that has the minimal cut
would be the partition
A
B
0.1
0.8
1
0.8
0.6
2
0.8
3
5
6
4
0.2
0.8
0.7
Min-Cut
Minimize weight of connections between groups
Optimal cut
Minimum cut
Normalized Cut
•Consider
the connectivity between groups relative to the density
of each group
•Normalize the association between groups by volume.
• Vol(A): The total weight of the edges originating from group A.
• Why use this criterion?
• Minimizing the normalized cut is equivalent to maximizing
normalized association.
• Produces more balanced partitions.
Spectral Graph Theory
•Possible
•
•
approach
Represent a similarity graph as a matrix
Apply knowledge from Linear Algebra…
•
The eigenvalues and eigenvectors of a matrix provide global
information about its structure.
•
Spectral Graph Theory
• Analyze the “spectrum” of matrix representing a graph.
• Spectrum : The eigenvectors of a graph, ordered by the
magnitude(strength) of their corresponding eigenvalues
Matrix Representation
0.1
0.8
5
1
0.8
0.6
2
0.8
0.8
6
4
0.7
0.2
3
0.1
0.8
5
1
0.8
0.6
2
0.8
3
6
4
0.2
0.8
0.7
x1
x2
x3
x4
x5
x6
x1
0
0.8
0.6
0
0.1
0
x2
0.8
0
0.8
0
0
0
x3
0.6
0.8
0
0.2
0
0
x4
0
0
0.2
0
0.8
0.7
x5
0.1
0
0
0.8
0
0.8
x6
0
0
0
0.7
0.8
0
x1
x2
x3
x4
x5
x6
x1
1.5
0
0
0
0
0
x2
0
1.6
0
0
0
0
x3
0
0
1.6
0
0
0
x4
0
0
0
1.7
0
0
x5
0
0
0
0
1.7
0
x6
0
0
0
0
0
1.5
Laplacian Matrix
x1
x2
x3
x4
x5
x6
x1
0
0.8
0.6
0
0.1
0
x2
0.8
0
0.8
0
0
0
x3
0.6
0.8
0
0.2
0
0
x4
0
0
0.2
0
0.8
0.7
0
x5
0.1
0
0
0.8
0
0.8
1.5
x6
0
0
0
0.7
0.8
0
x1
x2
x3
x4
x5
x6
x1
1.5
0
0
0
0
0
x2
0
1.6
0
0
0
0
x3
0
0
1.6
0
0
0
x4
0
0
0
1.7
0
0
x5
0
0
0
0
1.7
x6
0
0
0
0
0
-
0.1
0.8
5
1
0.8
0.6
2
0.8
6
4
0.7
3
0.2
0.8
=
x1
x2
x3
x4
x5
x6
x1
1.5
-0.8
-0.6
0
-0.1
0
x2
-0.8
1.6
-0.8
0
0
0
x3
-0.6
-0.8
1.6
-0.2
0
0
x4
0
0
-0.2
1.7
-0.8
-0.7
x5
-0.1
0
0
0.8-
1.7
-0.8
x6
0
0
0
-0.7
-0.8
1.5
Normalized Laplacian
0.1
0.8
5
1
0.8
0.6
2
0.8
0.8
6
4
1.00
-0.52
-0.39
0.00
-0.06
0.00
-0.52
1.00
-0.50
0.00
0.00
0.00
-0.39
-0.50
1.00
0.00
0.00
0.00
0.00
-0.12
1.00
-0.47
-0.44
-0.06
0.00
0.00
0.47-
1.00
-0.50
0.00
0.00
0.00
-0.44
-0.50
1.00
-0.12
0.7
3
0.2
Spectral Clustering
•
Three basic stages:
•
Pre-processing
• Construct a matrix representation of the dataset.
•
Decomposition
• Compute eigenvalues and eigenvectors of the
matrix.
• Map each point to a lower-dimensional
representation based on one or more
eigenvectors.
•
Grouping
• Assign points to two or more clusters, based on the
new representation.
625
25
Magic Sigma
σ = 0.035897
σ
=
0.015625
= 0.054409
σ =σ 0.035897
σσ
= 0.041235
= 0.054409
σ = 0.054409
σ0.015625
= 0.35355
σσ==0.35355
σ = 0.03125
σσ == 0.035897
0.35355
σ = 0.03125
σ=1
σσ==0.35355
1
σ=1
Figure 1: Spectral clustering without local scaling (usin
When the data incorporates multiple scales standard spec
the optimal σ for each example (displayed on each figure) t
row: Clustering results for the top-left point-set with diffe
the high impact σ has on the clustering quality. In all the
was set manually. The data points were normalized to occ
Local Scaling
Instead of selecting a single scaling parameter σ, calculate a
local scaling parameter σi for each data point
Zelnik-Manor, Lihi, and Pietro Perona. "Self-tuning spectral clustering." Advances in neural information processing systems. 2004.
Wine Spill
Spill a drop of wine on the cloth
Spread/diffuse to the neighborhood
37
38
Label Propagation
5
4
7
6
8
3
10
1
9
2
j and i are linked by an edge
Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. "Learning with local and global consistency." Advances in neural
information processing systems 16, no. 16 (2004): 321-328.
Perron-Frobenius Theorem
Label Propagation
…
If W is a stochastic
matrix, its spectral radius
will be no larger than 1
Will this procedure
converge?
39
40
Stochastic Matrix
We can use the row-normalized similarity
matrix as the new similarity matrix
Row Normalization
The similarities are no
longer symmetric
Xiaojin Zhu, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon
University, 2002.
41
Bi-Stochastic Matrix
We want a stochastic matrix which is symmetric
A right stochastic
matrix is a real
nonnegative square
matrix, with each row
summing to 1.
A left stochastic
matrix is a real
nonnegative square
matrix, with each
column summing to 1.
row
normalization
column
normalization
A bi-(doubly) stochastic
matrix is a real
nonnegative square
matrix, with both row and
column summing to 1.
row
normalization
……
Sinkhorn-Knopp Algorithm
This procedure will converge
to a bi-stochastic matrix
The solution that the SK algorithm
converges to is the bi-stochastic matrix
closest to W under KL-divergence
Sinkhorn, Richard, & Knopp, Paul. (1967). "Concerning nonnegative matrices and doubly stochastic matrices". Pacific J. Math. 21, 343–348.
Fei Wang, Ping Li, Arnd Christian König. "Learning a bi-stochastic data similarity matrix." IEEE ICDM (2010).
Factorization
Matrix Factorization
Matrix Factorization
X
X
~
~
F
x
F
G
x
G
Probabilistic Matrix Factorization
Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic matrix factorization." NIPS, 2011.
Nonnegative Matrix Factorization
Nonnegative Matrix Factorization
• Factorizinganonnegativematrixtothe
productoftwolow-rankmatrices
Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788-791 (21 October 1999)
Learning Medical Concepts
Zhou, Jiayu, Fei Wang, Jianying Hu, and Jieping Ye. "From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records."
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 135-144. ACM, 2014.
Learning Medical Concepts
▪ Assume that each patient has different medical concepts from other patients
Xi
Xi
Xi
Ui
Ui
Ui
Vi
Vi
Vi
▪ Formulation
Matrix Completion
Completion via matrix
factorization. Enforce a low
rank factorization Ui, Vi and
encourage the values of UiVi
at the observed location to
be close to the observed
ones.
Meaningful
Medical Concepts
Medical concepts
involve nonnegative
components of
medical features.
Sparsity
Controllable
sparsity that
encourages a
few medical
features in
each concept.
Temporal
Smoothness
Couple the columns of
Vi and force them to
close to each other.
Overfitting Control
Prevent Vi from
overfitting.
Zhou, Jiayu, Fei Wang, Jianying Hu, and Jieping Ye. "From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records."
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 135-144. ACM, 2014.
Dictionary Learning
Seek for a set of basis which can sparsely represent the data set
Data matrix
Coding coefficient
matrix, n by k
Basis matrix
Non-negativity
Enforce a sparse
representation of
the i-th data object
Dictionary Learning
Seek for a sparse
representation of the
data in the c-th group
A common dictionary shared
over all C groups of data
The data groups are pre-defined
The dictionary is shared over all data groups
Bengio, Samy, et al. "Group sparse coding." Advances in neural information processing systems. 2009.
Dictionary Learning
Learn both shared and group-specific dictionaries
Shared
dictionary
Group specific
dictionary for
group c
Wang, Fei, et al. "Automatic Group Sparse Coding." AAAI. 2011.
Longitudinal Records
Wang, Fei, et al. "Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach." Proceedings of the
18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.
Temporal Patterns
F. Wang et al. Towards Heterogeneous Temporal Clinical Event Pattern Discovery: A Convolutional Approach. KDD 2012.
One-Side Convolution
One-Side Convolutional NMF
Longitudinal Records
How$to$interpret$
and$make$use$of$the$
sequentiality$of$the$
events?$
The$sequentiality$of$those$
events$may$indicate$some$$
impending$disease$conditions$
Diagnosis(A(
Diagnosis(D(
Medication(B(
Medication(B(
Lab(Test(C(
...
to a variety of data and application domains identification
quence in Figure 1. In the sequence, we have 6 observations
4 montheventlarge-scale
longitudinal
data.we present the basic definition
of 4 of
unique events. We show the interval between
Figure
1.
from patient
EHRs. First
Overview
Temporal
graph
graph
weconstruction
propose
this
provides
an
happening
edge between the
temporal
graph
and in
how
it paper
is constructed.
eoral
observed
event
sequences,
we construct
the fol-timestamps along the ordered
Diagnosis
D width to
Medica
way to represent the temporal knowledges conobservations. In the graph, we use
the edge
sig1
day
3.1time
Temporal
graph
construction
mporal
graph
for
each
sequence
sn :
screte
data.
Theconstruction
temporal
graphsI capture
nify the weighted computed with the input sequence. As
Temporal
graph
Suppose
havesequences
a set of in
event
sequences
: n be
= seen, the parameters , r control the
Blocality
uctures
hiddenwe
in the
a more
com- {sn can
of the
Figure
1:
One
example
of
medical
event
sequenc
A
Graph
Based
Formulation
·1
· · nodes
, N (Temporal
} where
is theare
number
of appeared
sequences.
Eachedge
event
ion1,the
graph).
The temporal
graph
here
in theNgraph
events
computation
in(potential
the temporal
graph. Namely, a larger
subject
patient).
is denoted
by sn =
l = 1, · ·r· captures
, Ln )
nl , tnl ) : reandsequence
thesn
directed
edges encode
the((x
temporal
the similarities among events in a longer tempoence
is
a
directed
and
weighted
graph
with
our
wherepairwise
Ln is the
lengthInofthis
sn . representation,
In other words, we can
obbetween
events.
ral range,
which potentially increase the connectivity of the
s
its
node
set
{1,
·
·
·
,
M
},
where
the
weight
of
the
serve
event
x
at
time
t
in
the
sequence
s
.
We
let
the
nl
nl
Definition
(Temporal
graph)
missing
in patient
EHRs will
not appear in then
temporal graph, while a small r only considers closely adevents
xnlnode
2 {1, · · · is
, M defined
} and tnp 
tnq , for all p < q. We
node
i to
as
he repeated
pairwise jevents
with
the
same
orderjacent symbols as similar, which make A
the temporal graph
n
have
one
example
of
the
medical
event
sequences
of
potential
The temporal
graph
G With
of sequence
sn is a directed
and weighted
appear
once
in
the
graph.
this
represenmore
spread.
In the extreme case when r approaches infinity, C
X
patients
in
Figure
1.
n
1emporal
graphgraph
with is
our
eventand
setresistant
as its node
set {1, · · · W
, M},
where an
thealmost constant matrix, since all appearing
robust
to
sparse,
becomes
With the observed
event
sequences,
we (t
construct
[xedge
i ^node
xnqthis
=torepresenj]
tthe
,aspairs
(1)will be fully and equally connected.
np =
nq
np )folrregular
observations.
Moreover,
event
weight
of
the
from
i
node
j
is
defined
n lowing temporal graph for each sequence sn :
1pqL
y intuitive
andnhighly interpretable, because one
The parameters ( Figure
and r)
canThe
be selected
according
to
temporal
B 2:
D graph
X
Definition
1 1 (Temporal
graph).
nderstand
the
temporal
relationships
amongThe
dif-temporal
thegraph
application. For example, if there is little correlation
n
n
W
=
[x
=
i
^
x
=
j]
(t
) , happened with a time interval larger than
np
nq graph with
r between
nq
G
of sequence
snEHRs.
is a directed
and advantage
weighted
our tnp
ij
is events
a non-increasing
function.
al
in patient
Another
events
Ln set {1, · · · , M }, where the weight of the
event
set
as
its
node
1pqL
graph based representation,
then detected pheno3 months, then we can let
= 3 months. The scaling
edgeis
from
node
node
j isofdefined
as whichthus parameter
tterns)
will
beiintothe
form
graphs,
(·)
a also
non-increasing
function,
the morer can also be empirically setC to be the average
X
where
r 1is
a non-increasing
function
and r > 0time
is a interval
parameter.
d
as
a
nature
aggregation
of
sequential
patterns.
n
t i and
event j appear
toj]each
in sn , between consecutive events. In the example,
[xnp = close
i ^ xnq =
(tnq other
tnp ) , (1)
Wij =
we can e↵ectively
alleviate the pattern explosion
we use = 3 months and
r = 5 days.
In ourconstructed
empirical study t
With
allgraph
the
n Ln 1pqL
n
Figure
2:
The
temporal
of
event
sequence in
the Wij is. In this paper, we use the exceedance
on real-world EHR data
warehouse,
we
optimize
r
based on
identify
the
temporal
phenoty
where (·)distribution
is a non-increasing
phenotyping performance in specific applications.
ponential
tofunction.
construct the the
temporal
poral
graph
framework
for Records
phenotype
emporal
Phenotyping
frombased
Longitudinal
Electronic Health
3.2 Temporal phenoty
explain the observations. Our
HODOLOGY
Note that (·) is a non-increasing function, thus the more
3.2 Temporal
phenotyping
4 / 29
basis
as the
phenoty
often
i and
appear of
close
each other inDiagnosis
sn ,
A
Medication
B temporal
Lab Test
C we
ction
we event
will introduce
thej details
ourtotem( nevent
With
all
the
constructed
temporal
graphs,
1 day
1 week
the higher
the Wfor
In this paper,
we use the exceedance
ij is.
construct
the
observed
based
framework
phenotype
identification
identify the temporal phenotypes
that can temp
be used
exp(
/r)

of
the
Exponential
distribution
to
construct
the
temporal
4Our
month
)
=
(2)
EHRs.(First
we present the basic definition of
explain the
observations.
idea isexample,
to compute tw
have
one
simplified
graph:
0
>
aph and how it is constructed.
basis as the temporal phenotypes which can be us
(
and
one
observed
graph
ca
Diagnosis
D temporal
Medication
constructsis,
the
observed
graphs.
InBFigu
exp( /r)

1 day
( ) a
= smaller edge weight for a(2)
poralwe
graph
construction
have
one
simplified
example,
where
we have
g
ords,
compute
larger
of
the
first
two
basis.
In three
pract
0
>
sis, and one observed graph can be expressed as the
we have
a set ofevent. Otherwise,
sequences {sn we
: n ignore
=
val
,
when
the
event
intwo
the
beginning,
and
our
tem
example
ofbasis.
medical
event sequence
ofknow
one
words,
we compute
a smaller
edgeevent
weight for aFigure
larger 1: One
of the
first
In practice,
we
do not
hereInNother
is the
number
of sequences.
Each
(potential
time
intervalhappened
, when  with
. Otherwise,
we interval
ignore thesubject
event
in the patient).
beginning,
and our identifying
temporal phenotyping
the
events
a time
larger
the process
the ui
An Example Graph
An#$inflammatories.
Glucocor#coids.
Pulmonary.Disease.
Asthma.
Diure#cs.
ACE.Inhibitors.
Conges#ve.Heart.Failure.
Mineral.Replacement.
An#anginal.Agents.
Laxa#ves.
We call the resultant graph basis as temporal ph
since they are derived from the temporal graphs
temporal phenotypes capture evolving patterns of
Figure
3: Example of temporal
phenotyping.
Graph
Basis
conditions
hidden in the event sequences. To
b
wh
suppose we have constructed the temporal graph G
Graph bases as temporal phenotypes
n
sim
sequence
s
,
and
G
is associated
with
the
adjacen
n
1
tions as ⌦(A)
to
n
M ⇥M
n
1
matrix W 1 2 R
. To reconstruct G , we we
assu
patients under stu
k
2
3
4
We call the resultant
graph
basis
temporal
are as
K graph
basis Bphenotypes,
2 RM ⇥M for k = 1, 2, · · · ,
cha
2
4
can
be
used
to
approximate
the
adjacency
matrix
since they are derived from
the temporal graphs, and the
5
3.3.1 Similarit
K
temporal phenotypes capture evolving patterns of ntheX
healthk
W =
Ank B ,
conditions hidden in the event sequences. To be specific,
k=1 In the first case,
n
tween
patients ind
suppose we have1 constructed the temporal
graph
G
for
each
N
⇥K
wh
where A 2 R
is the matrix of reconstruction
co
n
orbasis
control).
We c
sequence sn , and G is associatedTo
with
thethe
adjacency
weight
compute
optimal graph
and the recon
tha
similar
phenotypin
coefficients,
minimize
the total
reconstruction
matrix W n2 2 R3M ⇥M4 . To reconstruct
Gn ,wewe
assume
there
ma
regularization:
N
K
X
are K graph basis B k 2 RM ⇥M for k = 1, 2, · · · 1, X
K,
which
n
k the
2
J
(A,
B)
=
kW
A
B
k
nk
F,
n
5
2 n=1
can be used to approximate the adjacency matrix
W : k=1 ⌦(A) =
4
5
5
3
……
3.
where k · kF is the matrix Frobenius norm.
Figure 3: Example ofntemporal phenotyping.
N ⇥N
To make
the solutions morewhere
interpretable,
we
als
k
S
2
R
W =
Ank
,
twoBconstraints
on the reconstruction coefficients
similarity informa
k
the graph basis B for k = 1, 2, · · · , K. The first I
k=1
we can
k just equiva
is
about
the
non-negativity,
i.e.,
B
0 for all k,
We call the resultant graph basis as temporal phenotypes,
ma
changing
⌦(A).
It
N ⇥K
original
temporal
graphs
are
non-negative.
The
se
since they
the temporal
graphs, and thecoefficients.
where
A 2areRderivedisfrom
the matrix
of reconstruction
P
K
X
we introduced earlier, the reconstruction coefficients in
consider the hin
nd k Ank = 1, for n As
= 1,
· · · , N,
1
A can be used for a various of applications.
In
particular,
for
Pr(An , Yn |H) =
,loss(A
A to be valid multinomial
distributhe medical diagnosis application, our goal is to derive1 infor+ exp( Yn f (An ))
quantify each patient
by the
tem- to improve the diagnosis performance, i.e.,
mative
features
and
where
the
linear
model
H
:
A
!
7
f
(A
probabilities
can
in turn knowledge
n ) = An ⇥ +
as ⌦(A) to which
incorporate
additional
on groups
the for the patients.nTo
the be
classification
of control/case
(⇥,
✓) are parameters
model H. It follows that⌦
edicine,
patient
ntsRegularization
under
study.segmentation,
this end, weand
extend the
temporal
phenotyping in
forthe
temporal
1
graphs with regularization ⌦(A)
Similarity based regularization N
0:
log Pr(A
n , Yn |H) =
log(1 + exp( Yn f (An ))).
K
3.4 Implem
X
X
1
n
n addition to
k 2
In
the
log-loss
for
probabilistic
model,
J (A, B) =
kW
Ank B kF + ⌦(A),
(4)
the first case, we may have implicit 2similarity
links
can also be used with the linear model H
n=1 loss terms
k=1 beWe give the
n
patients
indicating
they
are
from
the
same
group
(case
er, the reconstructionwhere
coefficients
in
consider the hinge loss for (A , Yn ):
0 is the parameter controlling the degree ofnregphenotyping p
ntrol).
We
can
encourage
the
linked
patients
to
have
us of applications. In ularization.
particular, In
forthe following, we propose several regularizaspecial case
loss(An , Yn |H) = max{0, 1 Yare
n f (An )},
ar phenotyping
representations
in A using the following
plication,
our goal
is to derive inforarization:
ve the diagnosis performance, i.e.,
and
ol/case groups1forX
the1patients. To 2
1 X
⌦(A) =
loss(An , Yn |H).
⌦(A)
=
kAtemporal
A n2 k S n 1 n 2 ,
n1
emporal
phenotyping
for
|L| n2L
2 n ,n 2
1 2
n ⌦(A) 0:
e S X
2K RN ⇥N > 0 is symmetric matrix 3.4
encoding
the
Implementation
n
arity
information.
that, when
Ank B k k2F +Note
⌦(A),
(4) S is asymmetric,
0
an just
k=1equivalently replace S with (S + S )/2 without
We give the implementation of the regularized tem
ging ⌦(A). It follows that Sn1 n2 = Sn2 n1 and
meter controlling the degree of regphenotyping problems, since the un-regularized pro
ing, we propose several1regularizaare special cases with = 0.
⌦(A) = tr(A0 LA),
(5)
2
e L = DP S and D is the diagonal degree matrix such
Dnn = n0 Snn0 . Note that, some rows/columns of S
be completely zero if we do not have knowledge about
Chuanren Liu, Fei Wang, Jianying Hu and Hui Xiong. Temporal Phenotyping with Electronic Health Records: A Graph Based
orresponding
patients,
e.g., the instances in the test set.
Approach. KDD
2015.
Clustering
Ding, Chris, et al. "Orthogonal nonnegative matrix t-factorizations for clustering." Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2006.
Questions?
[email protected]