pdf

Factorized Asymptotic Bayesian Inference
for Mixture Modeling
Ryohei Fujimaki
NEC Laboratories America
Department of Media Analytics
[email protected]
Satoshi Morinaga
NEC Corporation
Information and Media Processing Research Laboratories
[email protected]
Abstract
rion (BIC) [18] , an asymptotic second-order approximation using the Laplace method. Since BIC assumes
regularity conditions that ensure the asymptotic normality of the maximum likelihood (ML) estimator, it
loses its theoretical justification for non-regular models, including mixture models, hidden Markov models,
multi-layer neural networks, etc. Further, it is known
that the ML estimation does not have a unique solution for non-identifiable models in which the mapping between parameters and functions is not one-toone [22, 23] (such equivalent models are said to be in
an equivalent class). For such non-singular and nonidentifiable models, the generalization error of the ML
estimator significantly degrades [24].
This paper proposes a novel Bayesian approximation inference method for mixture
modeling.
Our key idea is to factorize
marginal log-likelihood using a variational
distribution over latent variables. An asymptotic approximation, a factorized information criterion (FIC), is obtained by applying the Laplace method to each of the factorized components.
In order to evaluate FIC, we propose factorized asymptotic
Bayesian inference (FAB), which maximizes
an asymptotically-consistent lower bound of
FIC. FIC and FAB have several desirable
properties: 1) asymptotic consistency with
the marginal log-likelihood, 2) automatic
component selection on the basis of an intrinsic shrinkage mechanism, and 3) parameter
identifiability in mixture modeling. Experimental results show that FAB outperforms
state-of-the-art VB methods.
1
Introduction
Model selection is one of the most difficult and important challenges in machine learning problems, in which
we optimize a model representation as well as model
parameters. Bayesian learning provides a natural and
sophisticated way to address the issue [1, 3]. A central task in Bayesian model selection is evaluation of
marginal log-likelihood. Since exact evaluation is often
computationally and analytically intractable, a number of approximation algorithms have been studied.
One well-known precursor is Bayes information criteAppearing in Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)
2012, La Palma, Canary Islands. Volume XX of JMLR:
W&CP XX. Copyright 2012 by the authors.
400
Among recent advanced Bayesian methods, we focus
on variational Bayesian (VB) inference. A VB method
maximizes a computationally-tractable lower bound of
the marginal log-likelihood by applying variational distributions on latent variables and parameters. Previous studies have investigated VB methods for a variety
of models [2, 6, 5, 12, 15, 16, 20] and have demonstrated their practicality. Further, theoretical studies [17, 21] have shown that VB methods can resolve
the non-regularity and non-identifiability issues. One
disadvantage, however, is that latent variables and
model parameters are partitioned into sub-groups that
are independent of each other in variational distributions. While this independence makes computation
tractable, the lower bound can be loose since their dependency is essential in the true distribution.
This paper proposes a novel Bayesian approximation
inference method for a family of mixture models which
is one of the most significant example of non-regular
and non-identifiable model families. Our ideas and
contributions are summarized as follows:
Factorized Information Criterion We derive a
new approximation of marginal log-likelihood and refer to it as a factorized information criterion (FIC). A
key observation is that, using a variational distribution
over latent variables, the marginal log-likelihood can
Factorized Asymptotic Bayesian Inference for Mixture Modeling
be rewritten as a factorized representation in which
the Laplace approximation is applicable to each of the
factorized components. FIC is unique in the sense that
it takes into account dependencies among latent variables and parameters, and FIC is asymptotically consistent with the marginal log-likelihood.
Gaussian mixture models (GMM), PCM, autoregressive mixture models) satisfy A1.
Factorized Asymptotic Bayesian Inference We
propose a new approximation inference algorithm and
refer to it as a factorized asymptotic Bayesian inference (FAB). Since FIC is defined on the basis of both
observed and latent variables, the latter of which are
not available, FAB maximizes a lower bound of FIC
through an iterative optimization similar to the EM
algorithm [8] and VB methods [2]. This FAB optimization is proved to monotonically increase the lower
bound of FIC. It is worth noting that FAB provides a
natural way to control model complexity not only in
terms of the number of components but also in terms
of the types of individual components (e.g., degrees of
components in a polynomial curve mixture (PCM).)
Let us denote a latent variable of X as Z =
(Z1 , . . . , ZC ). Z is a component assignment vector,
and Zc = 1 if X is generated from the c-th component, and Zc = 0 otherwise. The marginal distribution on Z and the conditional distribution
on
QC
X given Z can be described as p(Z|α) = c=1 αcZc
QC
and p(X|Z, φc ) = c=1 (pc (X|φc ))Zc . The observed
data and their latent variables are denoted as xN =
x1 , . . . , xN and z N = z 1 , . . . , z N , respectively, where
z n = (zn1 , . . . , znC ) and z N
c = (z1c , . . . , zN c ). We
make another assumption that log P (X, Z|θ) < ∞
holds (A2). Condition A2 is discussed in Section 4.5.
FIC and FAB offer the following desirable properties:
1. The FIC lower bound is asymptotically consistent
with FIC, and therefore with the marginal loglikelihood.
2. FAB automatically selects components on the basis of its intrinsic shrinkage mechanism different
from prior-based regularization. FAB therefore
mitigates overfitting even if it ignores the prior
effect in the asymptotic sense, or even if we apply
a non-informative prior [13].
3. FAB addresses the non-identifiability issue. In an
equivalent class, FAB automatically selects the
model which maximizes the entropy of distributions for latent variables.
2
Preliminaries
This paper considers mixture models p(X|θ) =
PC
c=1 αc pc (X|φc ) for a D-dimensional random variable X. C and α = (α1 , . . . , αC ) are the number of components and the mixture ratio. θ is
θ = (α, φ1 , . . . , φC ). We allow different components
pc (X|φc ) to be different in their model representations
from one another1 . We assume that pc (X|φc ) satisfies
the regularity conditions2 while p(X|θ) is non-regular
(A1). Assumption A1 is less stringent than the regularity conditions on p(X|θ), and many models (e.g.,
1
So-called “heterogeneous mixture models” [11]. In the
example of PCM, the degree of p1 (X|φ1 ) will be two while
that of p2 (X|φ2 ) will one.
2
The Fisher information matrix of pc (X|φc ) is nonsingular around the maximum likelihood estimator.
401
A model M of p(X|θ) is specified by C and models Sc
of pc (X|φc ), i.e., M as M = (C, S1 , . . . , SC ). Although
the representations of θ and φc depend on M and Sc ,
respectively, we omit them for notational simplicity.
3
Factorized Information Criterion for
Mixture Models
A Bayesian selects the model which maximizes the
model posterior p(M |xN ) ∝ p(M )p(xN |M ). With
uniform model prior p(M ), we are particularly interested in p(xN |M ), which is referred to as marginal
likelihood or Bayesian evidence. VB methods for latent variable models [2, 5, 6, 21] consider a lower bound
of the marginal log-likelihood to be:
XZ
p(xN |θ)p(θ|M )
dθ,
q(z N )qθ (θ) log
log p(xN |M ) ≥
q(z N )qθ (θ)
N
z
(1)
where q and qθ are variational distributions on z N
and θ, respectively. On q and qθ , z N and θ are assumed to be independent of each other in order to
make the lower bound computationally and analytically tractable. This ignores, however, significant dependency between z N and θ, and basically the equality does not hold.
In contrast to this, we consider the lower bound on
q(z N ) to be:
log p(xN |M ) ≥
X
zN
q(z N ) log
p(xN , z N |M ) ≡ VLB(q, xN , M ),
q(z N )
(2)
(3)
Lemma 1 [2] guarantees that maxq {VLB(q, xN , M )}
is exactly consistent with log p(xN |M ).
Lemma 1 The inequality (3) holds for an arbitrary
distribution q on z N , and the equality is satisfied by
q(z N ) = p(z N |M, xN ).
Ryohei Fujimaki, Satoshi Morinaga
Since this paper handles mixture models, we further
assume mutual independence of z N , i.e. q(z N ) =
QC
QN
N
N
znc
. The lower
c=1 q(z c ) and q(z c ) =
n=1 q(znc )
bound (2) is the same with that the collapse variational Bayesian (CVB) method [15, 20] considers. A
key difference is that FAB employs an asymptotic second approximation of (2), which produces several desirable properties of FAB which we have described in
Section 1, while CVB methods employ second oder
approximations of variational parameters in each iterative update step.
Then, by applying a prior of log p(θ|M ) = O(1),
p(xN , z N |M ) can be asymptotically approximated as:
where [A, a] represents the quadratic form aT Aa for
a matrix A and a vector a. Here we denote the ML
estimator of log p(xN , z N |θ) as θ̄ = (ᾱ, φ̄1 , . . . , φ̄C ).
We discuss application of the Laplace method around
the maximum a priori (MAP) estimator in Section 4.5.
F̄Z and F̄c are factorized sample approximations of
Fisher information matrices defined as follows:
1 ∂ 2 log p(z N |α) ,
(6)
F̄Z = −
N
∂α∂αT
α=ᾱ
−1
∂ 2 log pc (xN |z N
c , φc ) F̄c = PN
.
T
φc =φ̄c
∂φc ∂φc
n=1 znc
Theorem 3 F IC(xN , M ) is asymptotically consistent with log p(xN |M ) under A1 and log p(θ|M ) =
O(1).
p(xN , z N |M ) ≈p(xN , z N |θ̄)
C
Y
(2π)Dα /2
×
N Dα /2 |F̄Z |1/2
(9)
(2π)Dc /2
PN
Dc /2 |F̄ |1/2
c
c=1 (
n=1 znc )
Here, Dα ≡ D(α) = C − 1 and Dc ≡ D(φc ), in which
D(•) is the dimensionality of •.
On the basis of A1 and Lemma 2, both log |F̄c |1/2
Note that the numerator of (3) has the form of the
and log |F̄Z |1/2 are O(1). By substituting (9) into
N
N
parameter integration of p(x , z |M ), as:
(3) and ignoring asymptotically small terms, we obtain an asymptotic approximation of log p(xN |M ) =
Z
C
Y
maxq {VLB(q, xN , M )} (i.e., FIC) as follows:
p(xN , z N |M ) = p(z N |α)
pc (xN |z N
c , φc )p(θ|M )dθ.
o
n
c=1
N
N
(10)
J
(q,
θ̄,
x
)
F
IC(x
,M
)
=
max
(4)
q
X
A key idea in FIC is to apply the Laplace method
Dα
J (q, θ̄, xN ) =
q(z N ) log p(xN , z N |θ̄) −
log N
to the individual factorized distributions p(z N |α) and
2
N
3
N N
z
pc (x |z c , φc ) as follows :
C
N
o
X
X
Dc
i
Nh
znc ) − log q(z N )
−
log(
N N
N
N
F̄Z , (α − ᾱ)
log p(x ,z |θ) ≈ log p(x , z |θ̄) −
2
c=1
n=1
2
P
C
N
h
i
X
n=1 znc
The following theorem justifies FIC as an approxima−
F̄c , (φc − φ̄c ) ,
(5)
2
tion of the marginal log-likelihood.
c=1
The following lemma is satisfied for F̄Z and F̄c .
Lemma 2 With N → ∞, F̄Z and F̄c respectively converge to the Fisher information matrices described as:
Z 2
∂ log p(Z|α)
FZ = −
p(Z|α)dZ
,
(7)
∂α∂αT
α=ᾱ
Z 2
∂ log pc (X|φc )
Fc = −
p
(X|φ
)dX
. (8)
c
c
φc =φ̄c
∂φc ∂φTc
The proof follows the law of large numbers by taking
into account that the effective
number of samples for
PN
the c-th component is n=1 znc .
3
The Laplace approximation is not applicable to
p(xN |θ) because of its non-regularity.
402
Sketch of Proof 3 Since both pc (xN |z N
c , φc ) and
p(z N |α) satisfy the regularity condition (A1 and A2),
their Laplace approximations appearing in the left side
of (9) have asymptotic consistency (for the same reason as with BIC [7, 18, 19]). Therefore, their product
(9) asymptotically agrees with p(xN , z N |M ). By applying Lemma 1, this theorem is proved.
Despite our ignoring the P
prior effect, FICPstill has the
N
regularization term Dc zN q(z N
c ) log(
n=1 znc )/2.
c
This term has several interesting properties. First,
a dependency between z N and θ, which most of
VB methods ignore, explicitly appears. Like BIC,
it is not the parameter or function representation of
p(X|φc ) but only its dimensionality Dc that plays
an important role in the bridge between the latent
variables and the parameters. Second, the complexity
is adjusted by the value of
P of theNc-th component
PN
q(z
)
log(
z
).
Therefore, components of
N
nc
c
zc
n=1
different sizes are automatically regularized in their
individual appropriate levels. Third, roughly speaking, with respect to the terms, it is preferable for the
entropy of q(z N ) to be large. This makes the parameter estimation in FAB identifiable. More details are
discussed in Sections 4.4 and 4.5.
Factorized Asymptotic Bayesian Inference for Mixture Modeling
4
4.1
Factorized Asymptotic Bayesian
Inference Algorithm
4.2
Let us first fix C, and consider the following optimization problem:
Lower bound of FIC
Since θ̄ is not available in practice, we cannot evaluate
FIC itself. Instead, FAB maximizes an asymptoticallyconsistent lower bound for FIC. We firstly derive the
lower bound as follows:
Lemma 4 Let us define L(a, b) ≡ log b + (a − b)/b.
F IC(xN , M ) is lower bounded as follows:
F IC(xN , M ) ≥ G(q, q̃, θ, xN )
X
Dα
log N
≡
q(z N ) log p(xN , z N |θ) −
2
N
(11)
z
−
C
X
Dc
c=1
2
L(
N
X
n=1
znc ,
N
X
n=1
q̃(znc )) − log q(z N ) ,
In addition to the replacement of θ with the unavailable
PN estimator θ̄, a computationally
PN intractable
log( n=1 znc ) is linearized around n=1 q̃(znc ) with
a new parameter (distribution) q̃.
Now our problem is to solve the following maximization problem (note that q, θ and q̃ are functions of M ):
∗
∗
N
M , q , θ , q̃ = arg max G(q, q̃, θ, x ).
M,q,θ,q̃
(13)
S,q,θ,q̃
where S = (S1 , . . . , SC ). As the inner maximization,
FAB solves (13) on the basis of iterations of two substeps (V-step and M-step). Let the superscription (t)
represent the t-th iteration.
V step The V-step optimizes the variational distribution q as follows:
n
o
q (t) = arg max G(q, q̃ = q (t−1) , θ (t−1) , xN ) . (14)
More specifically, q (t) is obtained as follows:
Sketch of Proof 4 Since θ̄ is the ML estimator of
log p(xN , z N |θ), log p(xN , z N |θ̄) ≥ log p(xN , z N |θ)
is satisfied.
Further, on the basis Pof the conN
cavity of the logarithm function, log( n=1 znc ) ≤
PN
PN
L( n=1 znc , n=1 q̃(znc )) is satisfied. By substituting
these two inequalities into (10), we obtain (11).
∗
S ∗ , q ∗ , θ ∗ , q̃ ∗ = arg max G(q, q̃, θ, xN ),
q
with P
arbitrary choices of q, θ and a distribution q̃ on
N
z N ( n=1 q̃(znc ) > 0).
∗
Iterative Optimization Algorithm
(12)
The following lemma gives us a guide for optimizing
the newly introduced distribution q̃. We omit the proof
because of space limitations.
Lemma 5 If we fix q and θ, then q̃ = q maximizes
G(q, q̃, θ, xN ).
With a finite number of model candidates M, we can
use a two-stage algorithm in which we first solve the
maximization of (12) for all candidates in M, and
then select the best model. When we optimize only
the number of components C (e.g., as in a GMM), we
need to solve the inner maximization Cmax times (the
maximum search number of components.) However, if
we intend to optimize S1 , . . . , SC (e.g., as in PCM or
an autoregressive mixture), we must avoid a combinatorial scalability issue. FAB provides a natural way
to do this because the objective function is separated
into independent parts in terms of S1 , . . . , SC .
403
q (t) (znc ) ∝ αc(t−1) p(xn |φc(t−1) ) exp(
−Dc
(t−1)
2αc
N
), (15)
PN
(t−1)
where we use n=1 q (t−1) (znc ) = αc
N . The regularization terms which we discussed in Section 3
appear here as exponentiated update terms, i.e.,
c
exp( −D
). Roughly speaking, (15) indicates that
(t−1)
2αc
N
smaller and more complex components are likely to
become smaller through the iterations. The V-step
is different from the E-step of the EM algorithm in
c
the term exp( −D
). This makes an essential dif(t−1)
2αc
N
ference between FAB inference and the ML estimation
that employs the EM algorithm (Sections 4.4 and 4.5).
M step The M-step optimizes the models of individual components S and the parameter θ as follows:
S (t) , θ (t) = arg max G(q (t) , q̃ = q (t) , θ, xN ).
S,θ
(16)
G(q, q̃, θ, xN ) has a significant advantage in its avoiding of the combinatorial scalability issue on S selection. Since G(q, q̃, θ, xN ) has no cross term between
components (if we fix q̃), we can separately optimize
S and θ as follows:
αc(t) =
N
X
q (t) (znc )/N,
(17)
n=1
(t) (t)
N
Sc(t) , φ(t)
c = arg max Hc (q , q , φc , x )
Sc ,φc
Hc (q (t) , q (t) , φc ,xN ) =
−
N
X
n=1
(18)
q (t) (znc ) log p(xn |φc )
N
N
X
Dc X (t)
L(
q (znc ),
q (t) (znc )).
2
n=1
n=1
Ryohei Fujimaki, Satoshi Morinaga
Hc (q (t) , q (t) , φc , xN ) is a part of G(q (t) , q (t) , φc , xN ),
which is related to the c-th component. With a finite
set of component candidates, in (18), we first optimize
φc for each element of a fixed Sc and then select the
optimal one by comparing them. Note that, if we use a
single component type (e.g., GMM), our M-step eventually becomes the same as that in the EM algorithm.
P
log p(xN |θ). Then, | zN q (T ) (z N )(log p(xN , z N |θ̄) −
log p(xN , z N |θ (T ) ))|/N → 0 holds. 2) On the basis
PN
of the law of large numbers,
converges
n=1 znc /N
P
P
PN
C
to αc? . Then, | zN q (T ) (z N ) c=1 (log( n=1 znc ) −
PN
log( n=1 q (T ) (znc )))|/N → 0 holds. The substitution
of 1) and 2) into (11) proves the theorem.
4.3
The following theorem arises from Theorems 3 and 7
and guarantees that FAB will be asymptotically capable of evaluating the marginal log-likelihood itself:
Convergence and Consistency
After the t-th iteration, FIC is lower bounded as:
(T )
Theorem 8 F ICLB (xN , M (T ) )
(t)
F IC(xN , M ) ≥ F ICLB (xN , M ) ≡ G(q (t) , q (t) , θ (t) , xN ). consistent with log p(xN |M (T ) ).
(19)
We have the following theorem which guarantees that
(t)
FAB will monotonically increase F ICLB (xN , M ) over
the iterative optimization and also converge to the local minima under certain regularity conditions.
Theorem 6 For the iteration of the V-step and the
M-step, the following inequality is satisfied:
(t)
(t−1)
F ICLB (xN , M ) ≥ F ICLB (xN , M ).
(20)
Sketch of Proof 6 The theorem is proved as follows:
G(q (t) , q (t) , θ (t) , xN ) ≥ G(q (t) , q (t) , θ (t−1) , xN ) ≥
G(q (t) , q (t−1) , θ (t−1) , xN ) ≥ G(q (t−1) , q (t−1) , θ (t−1) , xN ).
The first and the third inequalities arise from (16) and
(14), respectively. The second inequality arises from
Lemma 5.
We then use the following stopping criterion for the
FAB iteration steps:
(t)
(t−1)
F ICLB (xN , M ) − F ICLB (xN , M ) ≤ ε,
(21)
where ε is an optimization tolerance parameter and is
set to 1e-6 in Section 5.
In addition to the monotonic property of FAB,
we present the theorem below, which asymptotictheoretically supports FAB. Let us denote M (t) =
(C, S (t) ) and let the superscriptions T and ? represent the number of steps at convergence and the true
model/parameters, respectively.
(T )
Theorem 7 F ICLB (xN , M (T ) )
consistent with F IC(xN , M (T ) ).
is
asymptotically
Sketch of Proof 7 1) In (15), exp(
(T −1)
(T )
−Dc
)
(T )
2αc N
con-
verges to one, and S
= S
holds without
loss of generality. Therefore, the FAB algorithm is
asymptotically reduced to the EM algorithm. This
means θ (T ) converges to the ML estimator θ̂ of
404
4.4
is
asymptotically
Shrinkage Mechanism
As has been noted in Sections 3 and 4.2, the
(t)
term exp(Dc /2αc N ) in (15), which arises from
P
PN
Dc zN
q(z N
c ) log(
n=1 znc )/2 in (10), plays a signifc
icant role in the control of model complexity.
Fig.1 illustrates the shrinkage effects of FAB in the
case of of a GMM. For D = 10 and N = 50, for example, a component of αc < 0.2 is severely regularized,
and such components are automatically shrunk (see
the left graph.) With increasing N , the shrinkage effect decreases, and small components manage to survive in the FAB optimization process. The right graph
shows the shrinkage effect in terms of the dimensionality. In a low dimensional space (e.g., D = 1, 3), small
components are not strictly regularized. For D = 20
and N = 100, however, a component of αc < 0.4 may
be shrunk, and only onePcomponent might survive beC
cause of the constraint c=1 αc .
−Dc
) which is
Let us consider the gradient of exp( 2α
cN
Dc
−Dc
−Dc
derived as ∂ exp( 2αc N )/∂αc = α2 N exp( 2αc N ) ≥ 0.
c
This indicates that the iteration of FAB accelerates
the shrinkages in order of O(αc−2 ). More intuitively
speaking, the smaller the component is, the more likely
it is to be shrunk. The above observations explain the
overfitting mitigation mechanism in FAB.
The shrinkage effect of FAB is different from the priorbased regularization of the MAP estimation. In fact,
it appears despite our asymptotically ignoring of the
prior effect, and even if a non-informative prior or a
flat prior is applied.
In terms of practical implementation, αc does not get
to exactly zero because (15) takes an exponentiated
update. Therefore, in our FAB procedure, we apply
a thresholding-based shrinkage of small components,
after the V-step, as follows:
(
PN
(t)
0 if
(t)
n=1 q (znc ) < δ
q (znc ) =
(22)
(t)
q (t) (znc )/Qc
otherwise
Factorized Asymptotic Bayesian Inference for Mixture Modeling
Figure 1: Shrinkage effects of FAB in GMM (Dc = D +
D(D + 1)/2, where D and Dc are, respectively, the data dimensionality and the parameter dimensionality.) The horizontal and vertical axes represent αc and exp(−Dc /(αc N )),
respectively. The effect is evident when the number of data
is not sufficient for the parameter dimensionality.
Algorithm 1 F ABtwo : Two-stage FAB
input : xN , Cmax , S, ε
∗
output : C ∗ , S ∗ , θ ∗ , q ∗ (z N ), F ICLB
∗
1: F ICLB = −∞
2: for C = 1, . . . , Cmax do
(T )
3:
Calculate F ICLB , C, S (T ) , θ (T ) , and q (T ) (z N )
by F ABshrink (xN , Cmax = C, S, ε, δ = 0).
4: end for
5: Choose C ∗ , S ∗ , θ ∗ and q ∗ (z N ) by (12).
(t)
δ and Qc are
PCa threshold value and a normalization
constant for c=1 q (t) (znc ) = 1, respectively. We used
the threshold value δ = 0.01N in Section 5. With the
shrinkage operation, FAB does not require the twostage optimization in (12). By starting from a sufficient number of components Cmax , FAB iteratively
and simultaneously optimizes all of M = (C, S), θ,
and q. Since FAB cannot revive shrunk components,
it is a greedy algorithm, like the least angle regression
method [9] which does not revive shrunk features.
The two-stage algorithm and the shrinkage algorithm
are shown here as Algorithms 1 and 2. S is a set
of component candidates (e.g., S = {Gaussian} for
GMM. S = {0, . . . , Kmax } for PCM where Kmax is the
maximum degree of curves.)
4.5
Identifiability
Algorithm 2 F ABshrink : Shrinkage FAB
input : xN , Cmax , S, ε, δ
(T )
output : F ICLB , C, S (T ) , θ (T ) , q (T ) (z N )
(0)
1: t = 0, F ICLB = −∞
2: randomly initialize q (0) (z N ).
3: while convergence do
4:
Calculate S (t) and θ (t) by (17) and (18).
5:
Check convergence by (21).
6:
Calculate q (t+1) by (15).
7:
Shrinkage components by (22).
8:
t=t+1
9: end while
10: T = t
In FAB, at the stationary (convergence) point of (15),
PN
the following equality is satisfied as n=1 q ∗ (znc ) =
PN
−Dc
αc∗ N = n=1 αc∗ p(xn |φ∗c ) exp( 2α
∗ N )/Zn , where Zn =
c
PC
∗
−Dc
∗
) is the normalization conc=1 αc p(xn |φc ) exp( 2α∗
N
c
stant.
For all pairs of αc∗ , αc∗0 > 0, we have
PN
∗
−Dc
the following condition:
)=
n=1 p(xn |φc ) exp( 2α∗
cN
PN
−Dc0
∗
n=1 p(xn |φc0 ) exp( 2α∗ N ). Then, the necessary conc0
dition of φ∗c = φ∗c0 is αc∗ = αc∗0 . Therefore, FAB chooses
the model, in an equivalent class, which maximizes the
entropy of q and addresses the non-identifiability issue.
Unfortunately, FIC and FAB do not resolve another
non-identifiability issue, i.e., divergence of the criterion. Without the assumption A2, (10) and (11) can
diverge to infinity (e.g., a GMM with a zero-variance
component4 .) We can address this issue by applying
the Laplace approximation around the MAP estimator θ̄ M AP of p(xN , z N |θ)p(θ) rather than around the
ML estimator θ̄. In such a case, a flat conjugate prior
might be used in order to avoid the divergence. While
such priors do not provide regularization effects, FAB
has an overfitting mitigation mechanism, as has been
previously discussed. Because of space limitations, we
omit here a detailed discussion of FIC and FAB with
prior distributions.
5
5.1
A well-known difficulty in ML estimation for mixture models is non-identiability, as we have noted in
Section 1. Let us consider a simple mixture model,
p(X|a, b, c) = ap(X|b) + (1 − a)p(X|c). All models
p(X|a, b, c = b) with arbitrary a are equivalent (i.e.,
in an equivalent class.) Therefore, the ML estimator is not unique. Theoretical studies have shown
that Bayesian estimators and VB estimators avoid the
above issue and significantly outperform the ML estimator in terms of generalization error [17, 21, 24].
405
Experiments
Polynomial Curve Mixture
We first evaluated F ABshrink for PCM to demonstrate
how its model selection works. We used the settings
N = 300 and Cmax = Kmax = 10 in this evaluation5 .
Let Y
be a random dependent variable of X.
4
Practically speaking, with the shrinkage (22), FAB
does not have the divergence issue because small components are automatically removed.
5
Larger values of Cmax and Kmax did not make a large
difference in results, but they were hard to visualize.
Ryohei Fujimaki, Satoshi Morinaga
Figure 2: Model selection procedure for F ABshrink in application to PCM. The true curves are : 1) Y = 5 + ε,
(t)
2) Y = X 2 + 5 + ε, 3) Y = X + ε, and 4) Y = −0.5X 3 + 2X + ε where ε ∼ N (0, 1). The top shows the change in F ICLB
over iteration t. The bottom three graphs are the estimated curves at t = 1, 6, and 44.
PCM has the mixed component pc (Y |X, φc ) =
N (Y, X(Sc )β c , σc2 ), where X(Sc ) = (1, X, . . . , X Sc ).
The true model has four curves as shown in Fig. 2.We
used the setting α = (0.3, 0.2, 0.3, 0.2) for the curves
1) - 4). xN was uniformly sampled from [−5, 5].
(t)
F ICLB
Fig. 2 illustrates
(top) and intermediate estimation results (bottom). From the top figure, we
are able to confirm that FAB monotonically increases
(t)
F ICLB , except for the points (dashed circles) of the
shrinkage operation. In the initial state t = 1 (bottom
left), q (0) (z N ) is randomly initialized, and therefore
the degrees of all ten curves are zero (Sc = 0). Over
the iteration, F ABshrink simultaneously searches C,
S, and θ. For example, at t = 15 (bottom center), we
have one curve of Sc = 9, two curves of Sc = 2, and
six curves of Sc = 0. Here two curves have already
been shrunk. Finally, F ABshrink selects the model
consistent with the true model at t = 44 (bottom left).
These results demonstrate the powerful simultaneous
model selection procedure of F ABshrink .
5.2
Gaussian Mixture Model
We applied F ABtwo and F ABshrink to GMM in application to artificial datasets and UCI datasets [10],
and compared them with the variational Bayesian EM
algorithm for GMM (VBEM) that is described in Section 10.2 of [3] and the variational Bayesian Dirichlet
Process GMM (VBDP) [5]. The former and the latter
use a Dirichlet prior and a Dirichlet process prior for
α, respectively. We used the implementations by M.
E. Khan6 for VBEM and by K. Kurihara7 for VBDP.
We used the default hyperparameter values in the software. In the following experiments, Cmax was set to
6
7
http://www.cs.ubc.ca/~emtiyaz/software.html
http://sites.google.com/site/kenichikurihara/.
406
Cmax = 20 for all methods.
5.2.1
Artificial Data
We generated the true model with D = 15, C ? = 5 and
φ?c = (µ?c , Σ?c ) (mean and covariance), where the superscription ? denotes the true model. The parameter
values were randomly sampled as αc? ∼ [0.4, 0.6] (before normalization), and µ?c ∼ [−5, 5]. Σ?c were generated as ac Σ̄?c , where ac ∼ [0.5, 1.5] is a scale parameter
and Σ̄?c is generated using the “gallery” function in
Matlab. The results are averages of ten runs.
The sufficient number of data is important measure to
evaluate asymptotic methods, and this is usually measured in terms of the empirical convergences of their
(T )
criteria. Table 1 shows how the values of F ICLB /N
converge over N . Roughly speaking, N = 1000 ∼ 2000
was sufficient for convergence (depending on data dimensionality D.) Our results indicate that FAB can
be applied in actual practice to recent data analysis
scenarios with large N values.
We next compared the four methods on the basis of
three evaluation metrics: 1) the Kullback-Leibler divergence (KL) between the true distribution and the
estimated distribution (estimation error), 2) |C ∗ −
C ? | (model selection error), and 3) CPU time.
For N ≤ 500, VBDP performed better than or comparably with F ABtwo and F ABshrink w.r.t. KL and
|C ∗ − C ? | (left and middle graphs of Fig. 3). With increasing N , those of the F ABs significantly decrease
while that of VBDP does not. This might be because
the lower bound in (1) generally does not reach the
marginal log-likelihood as has been noted. VBEM was
significantly inferior for both metrics. Another observation is that VBDP and VBEM were likely to re-
Factorized Asymptotic Bayesian Inference for Mixture Modeling
∗
Table 1: Convergence of F ICLB
over data size N . The standard deviations are in parenthesis.
N
F ABtwo
F ABshrink
50
-10.31 (3.51)
-12.78 (2.61)
100
-16.27 (2.51)
-17.22 (1.56)
250
-15.46 (1.19)
-15.66 (1.24)
500
-15.14 (1.05)
-15.16 (1.02)
1000
-14.63 (1.10)
-14.62 (0.85)
2000
-14.52 (0.50)
-14.68 (0.74)
3000
-14.34 (0.39)
-14.39 (0.39)
Table 2: Predictive likelihood per data. The standard deviations are in parenthesis. The maximum number of training
samples was limited to 2000 (parenthesis in the N column), and the rest of them were used for test. The best results and
those not significantly worse than them are highlighted in boldface (one-side t-test with 95% confidence.)
data
N
D
F ABtwo
F ABshrink
VBEM
VBDP
iris
150
4
-1.92 (0.76)
-2.11 (0.68)
-3.05 (0.64)
-1.65 (0.57)
yeast
1394
13
18.79 (1.81)
10.86 (3.38)
3.52 (3.80)
15.94 (0.72)
cloud
2048
10
7.89 (2.81)
9.09 (2.62)
5.47 (2.62)
2.67 (2.28)
wine quality
6497 (2000)
11
-2.68 (0.22)
-2.68 (0.17)
-10.49 (0.17) -16.52 (0.06)
color moments
68040 (2000)
9
-6.97 (0.20)
-6.95 (0.21)
-7.58 (0.21)
-7.66 (0.11)
cooc texture
68040 (2000)
16
14.45 (0.22)
14.45 (0.46)
13.45 (0.13)
6.78 (0.28)
spoken arabic digits 350319 (2000) 13 -12.94 (0.11) -12.90 (0.09) -13.81 (0.09) -14.34 (0.19)
Figure 3: Comparisons of four methods in terms of KL (left), |C ∗ − C ? | (middle), and 3) CPU time (right).
spectively under-fit and to over-fit their models, while
we did not find such biases in F ABs. While KLs
of F ABs are similar, F ABshrink performed better in
terms of |C ∗ − C ? | because small components survived
in F ABtwo . Further, both F ABs had a significant advantage over VBDP and VBEM in CPU time (right
graph). In particular, while the CPU time of VBDP
grew explosively over N , those of F ABs remained in a
practical range. Since F ABshrink does not use a loop
to optimize C, it was significantly faster than F ABtwo .
5.2.2
UCI data
For the UCI datasets summarized in Table 2, we do not
know their true distributions and therefore employed
predictive log-likelihood as an evaluation metric. For
large scale datasets, we randomly selected two thousands data for training and used the rest of the data
for prediction because of the scalability issue in VBDP.
As shown in Table 2, F ABs gave better performance
then VBDP and VBEM with a sufficient number of
data (the scale O(103 ) agrees with the results in the
previous section). The trend of VBDP (under-fit)
and VBEM (over-fit) was the same with the previous
section. Also, while the predictive log-likelihood of
F ABtwo and F ABshrink are competitive, F ABshrink
407
obtained more compact models (smaller C ∗ values)
than F ABtwo . Unfortunately, we have no space to
show the results here.
6
Summary and Future Work
We have proposed approximation of marginal loglikelihood (FIC) and an inference method (FAB) for
Bayesian model selection for mixture models, as an
alternative to VB inference. We have given their justifications (asymptotic consistency, convergence, etc)
and analyzed FAB mechanisms in terms of overfitting
mitigation (shrinkage) and identifiability. Experimental results have shown that FAB outperforms state-ofthe-art VB methods for a practical number of data in
terms of both model selection performance and computational efficiency.
Our next step is extensions of FAB to more general
non-regular and non-identifiable models which contain
continuous latent variables (e.g., factor analyzer models [12] and matrix factorization models [17]), timedependent latent variables (e.g., hidden Markov models [16]), hierarchical latent variables [14, 4], etc. We
believe that our idea, factorized representation of the
marginal log-likelihood in which the asymptotic approximation works, is widely applicable to them.
Ryohei Fujimaki, Satoshi Morinaga
References
[1] T. Ando. Bayesian Model Selection and Statistical
Modeling. Chapman and Hall/CRC, 2010.
[2] M. Beal. Variational Algorithms for Approximate
Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College
London, 2003.
[14] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural
Computation, 6:181–214, 1994.
[15] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational dirichlet process mixture models. In Proceedings of Twentieth International
Joint Conference on Artificial Intelligence, pages
2796–2801, 2007.
[3] C. M. Bishop. Pattern Recognition and Machine
Learning. Springer-Verlag, 2006.
[16] D. J. MacKay. Ensemble learning for hidden
Markov models. Technical report, University of
Cambridge, 1997.
[4] C. M. Bishop and M. E. Tipping. A hierarchical
latent variable model for data visualization. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 20(3):281–293, 1998.
[17] S. Nakajima and M. Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of
Machine Learning Research, 12:2579–2644, 2011.
[5] D. M. Blei and M. I. Jordan. Variational inference
for Dirichlet process mixtures. Bayesian Analysis,
1(1):121–144, 2006.
[6] A. Corduneanu and C. M. Bishop. Variational
Bayesian model selection for mixture distributions. In Proceedings of Artifficial Intelligence and
Statistics, pages 27–34, 2001.
[7] I. Csiszar and P. C. Shields. The consistency
of the BIC Markov order estimator. Annals of
Statistics, 28(6):1601–1619, 2000.
[8] A. P. Dempster, N. M. Laird, and D. B. Rubin.
Maximum likelihood from imcomplete data via
the EM algorithm. Journal of the Royal Statistical Society, B39(1):1–38, 1977.
[9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of
Statistics, 32(2):407–499, 2004.
[10] A. Frank and A. Asuncion. UCI machine learning
repository. http://archive.ics.uci.edu/ml,
2010.
[11] R. Fujimaki, Y. Sogawa, and S. Morinaga. Online
heterogeneous mixture modeling with marginal
and copula selection. In Proceedings of ACM
SIGKDD Conference on Knowledge Discovery
and Data Mining, pages 645–653, 2011.
[12] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. In
Advances in Neural Information Processing Systems 12, pages 449–455. MIT Press, 2000.
[13] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the
Royal Society of London. Series A, Mathematical
and Physical Sciences, 186:453–461, 1946.
408
[18] G. Schwarz. Estimating the dimension of a model.
The Annals of Statistics, 6(2):461–464, 1978.
[19] M. Stone. Comments on model selection criteria of Akaike and Schwarz. Journal of the Royal
Statistical Society: Series B, 41(2):276–278, 1979.
[20] Y. W. Teh, D. Newman, and M. Welling. A
collapsed variational bayesian inference algorithm
for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 1378–
1385, 2006.
[21] K. Watanabe and S. Watanabe.
Stochastic
complexities of Gaussian mixtures in variational
Bayesian approximation. Journal of Machine
Learning Research, 7:625–644, 2006.
[22] S. Watanabe. Algebraic analysis for nonidentiable
learning machines. Neural Computation, 13:899–
033, 2001.
[23] S. Watanabe. Algebraic geometry and statistical
learning. UK: Cambridge University Press, 2009.
[24] K. Yamazaki and S. Watanabe. Mixture models
and upper bounds of stochastic complexity. Neural Networks, 16:1029–1038, 2003.