Factorized Asymptotic Bayesian Inference for Mixture Modeling Ryohei Fujimaki NEC Laboratories America Department of Media Analytics [email protected] Satoshi Morinaga NEC Corporation Information and Media Processing Research Laboratories [email protected] Abstract rion (BIC) [18] , an asymptotic second-order approximation using the Laplace method. Since BIC assumes regularity conditions that ensure the asymptotic normality of the maximum likelihood (ML) estimator, it loses its theoretical justification for non-regular models, including mixture models, hidden Markov models, multi-layer neural networks, etc. Further, it is known that the ML estimation does not have a unique solution for non-identifiable models in which the mapping between parameters and functions is not one-toone [22, 23] (such equivalent models are said to be in an equivalent class). For such non-singular and nonidentifiable models, the generalization error of the ML estimator significantly degrades [24]. This paper proposes a novel Bayesian approximation inference method for mixture modeling. Our key idea is to factorize marginal log-likelihood using a variational distribution over latent variables. An asymptotic approximation, a factorized information criterion (FIC), is obtained by applying the Laplace method to each of the factorized components. In order to evaluate FIC, we propose factorized asymptotic Bayesian inference (FAB), which maximizes an asymptotically-consistent lower bound of FIC. FIC and FAB have several desirable properties: 1) asymptotic consistency with the marginal log-likelihood, 2) automatic component selection on the basis of an intrinsic shrinkage mechanism, and 3) parameter identifiability in mixture modeling. Experimental results show that FAB outperforms state-of-the-art VB methods. 1 Introduction Model selection is one of the most difficult and important challenges in machine learning problems, in which we optimize a model representation as well as model parameters. Bayesian learning provides a natural and sophisticated way to address the issue [1, 3]. A central task in Bayesian model selection is evaluation of marginal log-likelihood. Since exact evaluation is often computationally and analytically intractable, a number of approximation algorithms have been studied. One well-known precursor is Bayes information criteAppearing in Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS) 2012, La Palma, Canary Islands. Volume XX of JMLR: W&CP XX. Copyright 2012 by the authors. 400 Among recent advanced Bayesian methods, we focus on variational Bayesian (VB) inference. A VB method maximizes a computationally-tractable lower bound of the marginal log-likelihood by applying variational distributions on latent variables and parameters. Previous studies have investigated VB methods for a variety of models [2, 6, 5, 12, 15, 16, 20] and have demonstrated their practicality. Further, theoretical studies [17, 21] have shown that VB methods can resolve the non-regularity and non-identifiability issues. One disadvantage, however, is that latent variables and model parameters are partitioned into sub-groups that are independent of each other in variational distributions. While this independence makes computation tractable, the lower bound can be loose since their dependency is essential in the true distribution. This paper proposes a novel Bayesian approximation inference method for a family of mixture models which is one of the most significant example of non-regular and non-identifiable model families. Our ideas and contributions are summarized as follows: Factorized Information Criterion We derive a new approximation of marginal log-likelihood and refer to it as a factorized information criterion (FIC). A key observation is that, using a variational distribution over latent variables, the marginal log-likelihood can Factorized Asymptotic Bayesian Inference for Mixture Modeling be rewritten as a factorized representation in which the Laplace approximation is applicable to each of the factorized components. FIC is unique in the sense that it takes into account dependencies among latent variables and parameters, and FIC is asymptotically consistent with the marginal log-likelihood. Gaussian mixture models (GMM), PCM, autoregressive mixture models) satisfy A1. Factorized Asymptotic Bayesian Inference We propose a new approximation inference algorithm and refer to it as a factorized asymptotic Bayesian inference (FAB). Since FIC is defined on the basis of both observed and latent variables, the latter of which are not available, FAB maximizes a lower bound of FIC through an iterative optimization similar to the EM algorithm [8] and VB methods [2]. This FAB optimization is proved to monotonically increase the lower bound of FIC. It is worth noting that FAB provides a natural way to control model complexity not only in terms of the number of components but also in terms of the types of individual components (e.g., degrees of components in a polynomial curve mixture (PCM).) Let us denote a latent variable of X as Z = (Z1 , . . . , ZC ). Z is a component assignment vector, and Zc = 1 if X is generated from the c-th component, and Zc = 0 otherwise. The marginal distribution on Z and the conditional distribution on QC X given Z can be described as p(Z|α) = c=1 αcZc QC and p(X|Z, φc ) = c=1 (pc (X|φc ))Zc . The observed data and their latent variables are denoted as xN = x1 , . . . , xN and z N = z 1 , . . . , z N , respectively, where z n = (zn1 , . . . , znC ) and z N c = (z1c , . . . , zN c ). We make another assumption that log P (X, Z|θ) < ∞ holds (A2). Condition A2 is discussed in Section 4.5. FIC and FAB offer the following desirable properties: 1. The FIC lower bound is asymptotically consistent with FIC, and therefore with the marginal loglikelihood. 2. FAB automatically selects components on the basis of its intrinsic shrinkage mechanism different from prior-based regularization. FAB therefore mitigates overfitting even if it ignores the prior effect in the asymptotic sense, or even if we apply a non-informative prior [13]. 3. FAB addresses the non-identifiability issue. In an equivalent class, FAB automatically selects the model which maximizes the entropy of distributions for latent variables. 2 Preliminaries This paper considers mixture models p(X|θ) = PC c=1 αc pc (X|φc ) for a D-dimensional random variable X. C and α = (α1 , . . . , αC ) are the number of components and the mixture ratio. θ is θ = (α, φ1 , . . . , φC ). We allow different components pc (X|φc ) to be different in their model representations from one another1 . We assume that pc (X|φc ) satisfies the regularity conditions2 while p(X|θ) is non-regular (A1). Assumption A1 is less stringent than the regularity conditions on p(X|θ), and many models (e.g., 1 So-called “heterogeneous mixture models” [11]. In the example of PCM, the degree of p1 (X|φ1 ) will be two while that of p2 (X|φ2 ) will one. 2 The Fisher information matrix of pc (X|φc ) is nonsingular around the maximum likelihood estimator. 401 A model M of p(X|θ) is specified by C and models Sc of pc (X|φc ), i.e., M as M = (C, S1 , . . . , SC ). Although the representations of θ and φc depend on M and Sc , respectively, we omit them for notational simplicity. 3 Factorized Information Criterion for Mixture Models A Bayesian selects the model which maximizes the model posterior p(M |xN ) ∝ p(M )p(xN |M ). With uniform model prior p(M ), we are particularly interested in p(xN |M ), which is referred to as marginal likelihood or Bayesian evidence. VB methods for latent variable models [2, 5, 6, 21] consider a lower bound of the marginal log-likelihood to be: XZ p(xN |θ)p(θ|M ) dθ, q(z N )qθ (θ) log log p(xN |M ) ≥ q(z N )qθ (θ) N z (1) where q and qθ are variational distributions on z N and θ, respectively. On q and qθ , z N and θ are assumed to be independent of each other in order to make the lower bound computationally and analytically tractable. This ignores, however, significant dependency between z N and θ, and basically the equality does not hold. In contrast to this, we consider the lower bound on q(z N ) to be: log p(xN |M ) ≥ X zN q(z N ) log p(xN , z N |M ) ≡ VLB(q, xN , M ), q(z N ) (2) (3) Lemma 1 [2] guarantees that maxq {VLB(q, xN , M )} is exactly consistent with log p(xN |M ). Lemma 1 The inequality (3) holds for an arbitrary distribution q on z N , and the equality is satisfied by q(z N ) = p(z N |M, xN ). Ryohei Fujimaki, Satoshi Morinaga Since this paper handles mixture models, we further assume mutual independence of z N , i.e. q(z N ) = QC QN N N znc . The lower c=1 q(z c ) and q(z c ) = n=1 q(znc ) bound (2) is the same with that the collapse variational Bayesian (CVB) method [15, 20] considers. A key difference is that FAB employs an asymptotic second approximation of (2), which produces several desirable properties of FAB which we have described in Section 1, while CVB methods employ second oder approximations of variational parameters in each iterative update step. Then, by applying a prior of log p(θ|M ) = O(1), p(xN , z N |M ) can be asymptotically approximated as: where [A, a] represents the quadratic form aT Aa for a matrix A and a vector a. Here we denote the ML estimator of log p(xN , z N |θ) as θ̄ = (ᾱ, φ̄1 , . . . , φ̄C ). We discuss application of the Laplace method around the maximum a priori (MAP) estimator in Section 4.5. F̄Z and F̄c are factorized sample approximations of Fisher information matrices defined as follows: 1 ∂ 2 log p(z N |α) , (6) F̄Z = − N ∂α∂αT α=ᾱ −1 ∂ 2 log pc (xN |z N c , φc ) F̄c = PN . T φc =φ̄c ∂φc ∂φc n=1 znc Theorem 3 F IC(xN , M ) is asymptotically consistent with log p(xN |M ) under A1 and log p(θ|M ) = O(1). p(xN , z N |M ) ≈p(xN , z N |θ̄) C Y (2π)Dα /2 × N Dα /2 |F̄Z |1/2 (9) (2π)Dc /2 PN Dc /2 |F̄ |1/2 c c=1 ( n=1 znc ) Here, Dα ≡ D(α) = C − 1 and Dc ≡ D(φc ), in which D(•) is the dimensionality of •. On the basis of A1 and Lemma 2, both log |F̄c |1/2 Note that the numerator of (3) has the form of the and log |F̄Z |1/2 are O(1). By substituting (9) into N N parameter integration of p(x , z |M ), as: (3) and ignoring asymptotically small terms, we obtain an asymptotic approximation of log p(xN |M ) = Z C Y maxq {VLB(q, xN , M )} (i.e., FIC) as follows: p(xN , z N |M ) = p(z N |α) pc (xN |z N c , φc )p(θ|M )dθ. o n c=1 N N (10) J (q, θ̄, x ) F IC(x ,M ) = max (4) q X A key idea in FIC is to apply the Laplace method Dα J (q, θ̄, xN ) = q(z N ) log p(xN , z N |θ̄) − log N to the individual factorized distributions p(z N |α) and 2 N 3 N N z pc (x |z c , φc ) as follows : C N o X X Dc i Nh znc ) − log q(z N ) − log( N N N N F̄Z , (α − ᾱ) log p(x ,z |θ) ≈ log p(x , z |θ̄) − 2 c=1 n=1 2 P C N h i X n=1 znc The following theorem justifies FIC as an approxima− F̄c , (φc − φ̄c ) , (5) 2 tion of the marginal log-likelihood. c=1 The following lemma is satisfied for F̄Z and F̄c . Lemma 2 With N → ∞, F̄Z and F̄c respectively converge to the Fisher information matrices described as: Z 2 ∂ log p(Z|α) FZ = − p(Z|α)dZ , (7) ∂α∂αT α=ᾱ Z 2 ∂ log pc (X|φc ) Fc = − p (X|φ )dX . (8) c c φc =φ̄c ∂φc ∂φTc The proof follows the law of large numbers by taking into account that the effective number of samples for PN the c-th component is n=1 znc . 3 The Laplace approximation is not applicable to p(xN |θ) because of its non-regularity. 402 Sketch of Proof 3 Since both pc (xN |z N c , φc ) and p(z N |α) satisfy the regularity condition (A1 and A2), their Laplace approximations appearing in the left side of (9) have asymptotic consistency (for the same reason as with BIC [7, 18, 19]). Therefore, their product (9) asymptotically agrees with p(xN , z N |M ). By applying Lemma 1, this theorem is proved. Despite our ignoring the P prior effect, FICPstill has the N regularization term Dc zN q(z N c ) log( n=1 znc )/2. c This term has several interesting properties. First, a dependency between z N and θ, which most of VB methods ignore, explicitly appears. Like BIC, it is not the parameter or function representation of p(X|φc ) but only its dimensionality Dc that plays an important role in the bridge between the latent variables and the parameters. Second, the complexity is adjusted by the value of P of theNc-th component PN q(z ) log( z ). Therefore, components of N nc c zc n=1 different sizes are automatically regularized in their individual appropriate levels. Third, roughly speaking, with respect to the terms, it is preferable for the entropy of q(z N ) to be large. This makes the parameter estimation in FAB identifiable. More details are discussed in Sections 4.4 and 4.5. Factorized Asymptotic Bayesian Inference for Mixture Modeling 4 4.1 Factorized Asymptotic Bayesian Inference Algorithm 4.2 Let us first fix C, and consider the following optimization problem: Lower bound of FIC Since θ̄ is not available in practice, we cannot evaluate FIC itself. Instead, FAB maximizes an asymptoticallyconsistent lower bound for FIC. We firstly derive the lower bound as follows: Lemma 4 Let us define L(a, b) ≡ log b + (a − b)/b. F IC(xN , M ) is lower bounded as follows: F IC(xN , M ) ≥ G(q, q̃, θ, xN ) X Dα log N ≡ q(z N ) log p(xN , z N |θ) − 2 N (11) z − C X Dc c=1 2 L( N X n=1 znc , N X n=1 q̃(znc )) − log q(z N ) , In addition to the replacement of θ with the unavailable PN estimator θ̄, a computationally PN intractable log( n=1 znc ) is linearized around n=1 q̃(znc ) with a new parameter (distribution) q̃. Now our problem is to solve the following maximization problem (note that q, θ and q̃ are functions of M ): ∗ ∗ N M , q , θ , q̃ = arg max G(q, q̃, θ, x ). M,q,θ,q̃ (13) S,q,θ,q̃ where S = (S1 , . . . , SC ). As the inner maximization, FAB solves (13) on the basis of iterations of two substeps (V-step and M-step). Let the superscription (t) represent the t-th iteration. V step The V-step optimizes the variational distribution q as follows: n o q (t) = arg max G(q, q̃ = q (t−1) , θ (t−1) , xN ) . (14) More specifically, q (t) is obtained as follows: Sketch of Proof 4 Since θ̄ is the ML estimator of log p(xN , z N |θ), log p(xN , z N |θ̄) ≥ log p(xN , z N |θ) is satisfied. Further, on the basis Pof the conN cavity of the logarithm function, log( n=1 znc ) ≤ PN PN L( n=1 znc , n=1 q̃(znc )) is satisfied. By substituting these two inequalities into (10), we obtain (11). ∗ S ∗ , q ∗ , θ ∗ , q̃ ∗ = arg max G(q, q̃, θ, xN ), q with P arbitrary choices of q, θ and a distribution q̃ on N z N ( n=1 q̃(znc ) > 0). ∗ Iterative Optimization Algorithm (12) The following lemma gives us a guide for optimizing the newly introduced distribution q̃. We omit the proof because of space limitations. Lemma 5 If we fix q and θ, then q̃ = q maximizes G(q, q̃, θ, xN ). With a finite number of model candidates M, we can use a two-stage algorithm in which we first solve the maximization of (12) for all candidates in M, and then select the best model. When we optimize only the number of components C (e.g., as in a GMM), we need to solve the inner maximization Cmax times (the maximum search number of components.) However, if we intend to optimize S1 , . . . , SC (e.g., as in PCM or an autoregressive mixture), we must avoid a combinatorial scalability issue. FAB provides a natural way to do this because the objective function is separated into independent parts in terms of S1 , . . . , SC . 403 q (t) (znc ) ∝ αc(t−1) p(xn |φc(t−1) ) exp( −Dc (t−1) 2αc N ), (15) PN (t−1) where we use n=1 q (t−1) (znc ) = αc N . The regularization terms which we discussed in Section 3 appear here as exponentiated update terms, i.e., c exp( −D ). Roughly speaking, (15) indicates that (t−1) 2αc N smaller and more complex components are likely to become smaller through the iterations. The V-step is different from the E-step of the EM algorithm in c the term exp( −D ). This makes an essential dif(t−1) 2αc N ference between FAB inference and the ML estimation that employs the EM algorithm (Sections 4.4 and 4.5). M step The M-step optimizes the models of individual components S and the parameter θ as follows: S (t) , θ (t) = arg max G(q (t) , q̃ = q (t) , θ, xN ). S,θ (16) G(q, q̃, θ, xN ) has a significant advantage in its avoiding of the combinatorial scalability issue on S selection. Since G(q, q̃, θ, xN ) has no cross term between components (if we fix q̃), we can separately optimize S and θ as follows: αc(t) = N X q (t) (znc )/N, (17) n=1 (t) (t) N Sc(t) , φ(t) c = arg max Hc (q , q , φc , x ) Sc ,φc Hc (q (t) , q (t) , φc ,xN ) = − N X n=1 (18) q (t) (znc ) log p(xn |φc ) N N X Dc X (t) L( q (znc ), q (t) (znc )). 2 n=1 n=1 Ryohei Fujimaki, Satoshi Morinaga Hc (q (t) , q (t) , φc , xN ) is a part of G(q (t) , q (t) , φc , xN ), which is related to the c-th component. With a finite set of component candidates, in (18), we first optimize φc for each element of a fixed Sc and then select the optimal one by comparing them. Note that, if we use a single component type (e.g., GMM), our M-step eventually becomes the same as that in the EM algorithm. P log p(xN |θ). Then, | zN q (T ) (z N )(log p(xN , z N |θ̄) − log p(xN , z N |θ (T ) ))|/N → 0 holds. 2) On the basis PN of the law of large numbers, converges n=1 znc /N P P PN C to αc? . Then, | zN q (T ) (z N ) c=1 (log( n=1 znc ) − PN log( n=1 q (T ) (znc )))|/N → 0 holds. The substitution of 1) and 2) into (11) proves the theorem. 4.3 The following theorem arises from Theorems 3 and 7 and guarantees that FAB will be asymptotically capable of evaluating the marginal log-likelihood itself: Convergence and Consistency After the t-th iteration, FIC is lower bounded as: (T ) Theorem 8 F ICLB (xN , M (T ) ) (t) F IC(xN , M ) ≥ F ICLB (xN , M ) ≡ G(q (t) , q (t) , θ (t) , xN ). consistent with log p(xN |M (T ) ). (19) We have the following theorem which guarantees that (t) FAB will monotonically increase F ICLB (xN , M ) over the iterative optimization and also converge to the local minima under certain regularity conditions. Theorem 6 For the iteration of the V-step and the M-step, the following inequality is satisfied: (t) (t−1) F ICLB (xN , M ) ≥ F ICLB (xN , M ). (20) Sketch of Proof 6 The theorem is proved as follows: G(q (t) , q (t) , θ (t) , xN ) ≥ G(q (t) , q (t) , θ (t−1) , xN ) ≥ G(q (t) , q (t−1) , θ (t−1) , xN ) ≥ G(q (t−1) , q (t−1) , θ (t−1) , xN ). The first and the third inequalities arise from (16) and (14), respectively. The second inequality arises from Lemma 5. We then use the following stopping criterion for the FAB iteration steps: (t) (t−1) F ICLB (xN , M ) − F ICLB (xN , M ) ≤ ε, (21) where ε is an optimization tolerance parameter and is set to 1e-6 in Section 5. In addition to the monotonic property of FAB, we present the theorem below, which asymptotictheoretically supports FAB. Let us denote M (t) = (C, S (t) ) and let the superscriptions T and ? represent the number of steps at convergence and the true model/parameters, respectively. (T ) Theorem 7 F ICLB (xN , M (T ) ) consistent with F IC(xN , M (T ) ). is asymptotically Sketch of Proof 7 1) In (15), exp( (T −1) (T ) −Dc ) (T ) 2αc N con- verges to one, and S = S holds without loss of generality. Therefore, the FAB algorithm is asymptotically reduced to the EM algorithm. This means θ (T ) converges to the ML estimator θ̂ of 404 4.4 is asymptotically Shrinkage Mechanism As has been noted in Sections 3 and 4.2, the (t) term exp(Dc /2αc N ) in (15), which arises from P PN Dc zN q(z N c ) log( n=1 znc )/2 in (10), plays a signifc icant role in the control of model complexity. Fig.1 illustrates the shrinkage effects of FAB in the case of of a GMM. For D = 10 and N = 50, for example, a component of αc < 0.2 is severely regularized, and such components are automatically shrunk (see the left graph.) With increasing N , the shrinkage effect decreases, and small components manage to survive in the FAB optimization process. The right graph shows the shrinkage effect in terms of the dimensionality. In a low dimensional space (e.g., D = 1, 3), small components are not strictly regularized. For D = 20 and N = 100, however, a component of αc < 0.4 may be shrunk, and only onePcomponent might survive beC cause of the constraint c=1 αc . −Dc ) which is Let us consider the gradient of exp( 2α cN Dc −Dc −Dc derived as ∂ exp( 2αc N )/∂αc = α2 N exp( 2αc N ) ≥ 0. c This indicates that the iteration of FAB accelerates the shrinkages in order of O(αc−2 ). More intuitively speaking, the smaller the component is, the more likely it is to be shrunk. The above observations explain the overfitting mitigation mechanism in FAB. The shrinkage effect of FAB is different from the priorbased regularization of the MAP estimation. In fact, it appears despite our asymptotically ignoring of the prior effect, and even if a non-informative prior or a flat prior is applied. In terms of practical implementation, αc does not get to exactly zero because (15) takes an exponentiated update. Therefore, in our FAB procedure, we apply a thresholding-based shrinkage of small components, after the V-step, as follows: ( PN (t) 0 if (t) n=1 q (znc ) < δ q (znc ) = (22) (t) q (t) (znc )/Qc otherwise Factorized Asymptotic Bayesian Inference for Mixture Modeling Figure 1: Shrinkage effects of FAB in GMM (Dc = D + D(D + 1)/2, where D and Dc are, respectively, the data dimensionality and the parameter dimensionality.) The horizontal and vertical axes represent αc and exp(−Dc /(αc N )), respectively. The effect is evident when the number of data is not sufficient for the parameter dimensionality. Algorithm 1 F ABtwo : Two-stage FAB input : xN , Cmax , S, ε ∗ output : C ∗ , S ∗ , θ ∗ , q ∗ (z N ), F ICLB ∗ 1: F ICLB = −∞ 2: for C = 1, . . . , Cmax do (T ) 3: Calculate F ICLB , C, S (T ) , θ (T ) , and q (T ) (z N ) by F ABshrink (xN , Cmax = C, S, ε, δ = 0). 4: end for 5: Choose C ∗ , S ∗ , θ ∗ and q ∗ (z N ) by (12). (t) δ and Qc are PCa threshold value and a normalization constant for c=1 q (t) (znc ) = 1, respectively. We used the threshold value δ = 0.01N in Section 5. With the shrinkage operation, FAB does not require the twostage optimization in (12). By starting from a sufficient number of components Cmax , FAB iteratively and simultaneously optimizes all of M = (C, S), θ, and q. Since FAB cannot revive shrunk components, it is a greedy algorithm, like the least angle regression method [9] which does not revive shrunk features. The two-stage algorithm and the shrinkage algorithm are shown here as Algorithms 1 and 2. S is a set of component candidates (e.g., S = {Gaussian} for GMM. S = {0, . . . , Kmax } for PCM where Kmax is the maximum degree of curves.) 4.5 Identifiability Algorithm 2 F ABshrink : Shrinkage FAB input : xN , Cmax , S, ε, δ (T ) output : F ICLB , C, S (T ) , θ (T ) , q (T ) (z N ) (0) 1: t = 0, F ICLB = −∞ 2: randomly initialize q (0) (z N ). 3: while convergence do 4: Calculate S (t) and θ (t) by (17) and (18). 5: Check convergence by (21). 6: Calculate q (t+1) by (15). 7: Shrinkage components by (22). 8: t=t+1 9: end while 10: T = t In FAB, at the stationary (convergence) point of (15), PN the following equality is satisfied as n=1 q ∗ (znc ) = PN −Dc αc∗ N = n=1 αc∗ p(xn |φ∗c ) exp( 2α ∗ N )/Zn , where Zn = c PC ∗ −Dc ∗ ) is the normalization conc=1 αc p(xn |φc ) exp( 2α∗ N c stant. For all pairs of αc∗ , αc∗0 > 0, we have PN ∗ −Dc the following condition: )= n=1 p(xn |φc ) exp( 2α∗ cN PN −Dc0 ∗ n=1 p(xn |φc0 ) exp( 2α∗ N ). Then, the necessary conc0 dition of φ∗c = φ∗c0 is αc∗ = αc∗0 . Therefore, FAB chooses the model, in an equivalent class, which maximizes the entropy of q and addresses the non-identifiability issue. Unfortunately, FIC and FAB do not resolve another non-identifiability issue, i.e., divergence of the criterion. Without the assumption A2, (10) and (11) can diverge to infinity (e.g., a GMM with a zero-variance component4 .) We can address this issue by applying the Laplace approximation around the MAP estimator θ̄ M AP of p(xN , z N |θ)p(θ) rather than around the ML estimator θ̄. In such a case, a flat conjugate prior might be used in order to avoid the divergence. While such priors do not provide regularization effects, FAB has an overfitting mitigation mechanism, as has been previously discussed. Because of space limitations, we omit here a detailed discussion of FIC and FAB with prior distributions. 5 5.1 A well-known difficulty in ML estimation for mixture models is non-identiability, as we have noted in Section 1. Let us consider a simple mixture model, p(X|a, b, c) = ap(X|b) + (1 − a)p(X|c). All models p(X|a, b, c = b) with arbitrary a are equivalent (i.e., in an equivalent class.) Therefore, the ML estimator is not unique. Theoretical studies have shown that Bayesian estimators and VB estimators avoid the above issue and significantly outperform the ML estimator in terms of generalization error [17, 21, 24]. 405 Experiments Polynomial Curve Mixture We first evaluated F ABshrink for PCM to demonstrate how its model selection works. We used the settings N = 300 and Cmax = Kmax = 10 in this evaluation5 . Let Y be a random dependent variable of X. 4 Practically speaking, with the shrinkage (22), FAB does not have the divergence issue because small components are automatically removed. 5 Larger values of Cmax and Kmax did not make a large difference in results, but they were hard to visualize. Ryohei Fujimaki, Satoshi Morinaga Figure 2: Model selection procedure for F ABshrink in application to PCM. The true curves are : 1) Y = 5 + ε, (t) 2) Y = X 2 + 5 + ε, 3) Y = X + ε, and 4) Y = −0.5X 3 + 2X + ε where ε ∼ N (0, 1). The top shows the change in F ICLB over iteration t. The bottom three graphs are the estimated curves at t = 1, 6, and 44. PCM has the mixed component pc (Y |X, φc ) = N (Y, X(Sc )β c , σc2 ), where X(Sc ) = (1, X, . . . , X Sc ). The true model has four curves as shown in Fig. 2.We used the setting α = (0.3, 0.2, 0.3, 0.2) for the curves 1) - 4). xN was uniformly sampled from [−5, 5]. (t) F ICLB Fig. 2 illustrates (top) and intermediate estimation results (bottom). From the top figure, we are able to confirm that FAB monotonically increases (t) F ICLB , except for the points (dashed circles) of the shrinkage operation. In the initial state t = 1 (bottom left), q (0) (z N ) is randomly initialized, and therefore the degrees of all ten curves are zero (Sc = 0). Over the iteration, F ABshrink simultaneously searches C, S, and θ. For example, at t = 15 (bottom center), we have one curve of Sc = 9, two curves of Sc = 2, and six curves of Sc = 0. Here two curves have already been shrunk. Finally, F ABshrink selects the model consistent with the true model at t = 44 (bottom left). These results demonstrate the powerful simultaneous model selection procedure of F ABshrink . 5.2 Gaussian Mixture Model We applied F ABtwo and F ABshrink to GMM in application to artificial datasets and UCI datasets [10], and compared them with the variational Bayesian EM algorithm for GMM (VBEM) that is described in Section 10.2 of [3] and the variational Bayesian Dirichlet Process GMM (VBDP) [5]. The former and the latter use a Dirichlet prior and a Dirichlet process prior for α, respectively. We used the implementations by M. E. Khan6 for VBEM and by K. Kurihara7 for VBDP. We used the default hyperparameter values in the software. In the following experiments, Cmax was set to 6 7 http://www.cs.ubc.ca/~emtiyaz/software.html http://sites.google.com/site/kenichikurihara/. 406 Cmax = 20 for all methods. 5.2.1 Artificial Data We generated the true model with D = 15, C ? = 5 and φ?c = (µ?c , Σ?c ) (mean and covariance), where the superscription ? denotes the true model. The parameter values were randomly sampled as αc? ∼ [0.4, 0.6] (before normalization), and µ?c ∼ [−5, 5]. Σ?c were generated as ac Σ̄?c , where ac ∼ [0.5, 1.5] is a scale parameter and Σ̄?c is generated using the “gallery” function in Matlab. The results are averages of ten runs. The sufficient number of data is important measure to evaluate asymptotic methods, and this is usually measured in terms of the empirical convergences of their (T ) criteria. Table 1 shows how the values of F ICLB /N converge over N . Roughly speaking, N = 1000 ∼ 2000 was sufficient for convergence (depending on data dimensionality D.) Our results indicate that FAB can be applied in actual practice to recent data analysis scenarios with large N values. We next compared the four methods on the basis of three evaluation metrics: 1) the Kullback-Leibler divergence (KL) between the true distribution and the estimated distribution (estimation error), 2) |C ∗ − C ? | (model selection error), and 3) CPU time. For N ≤ 500, VBDP performed better than or comparably with F ABtwo and F ABshrink w.r.t. KL and |C ∗ − C ? | (left and middle graphs of Fig. 3). With increasing N , those of the F ABs significantly decrease while that of VBDP does not. This might be because the lower bound in (1) generally does not reach the marginal log-likelihood as has been noted. VBEM was significantly inferior for both metrics. Another observation is that VBDP and VBEM were likely to re- Factorized Asymptotic Bayesian Inference for Mixture Modeling ∗ Table 1: Convergence of F ICLB over data size N . The standard deviations are in parenthesis. N F ABtwo F ABshrink 50 -10.31 (3.51) -12.78 (2.61) 100 -16.27 (2.51) -17.22 (1.56) 250 -15.46 (1.19) -15.66 (1.24) 500 -15.14 (1.05) -15.16 (1.02) 1000 -14.63 (1.10) -14.62 (0.85) 2000 -14.52 (0.50) -14.68 (0.74) 3000 -14.34 (0.39) -14.39 (0.39) Table 2: Predictive likelihood per data. The standard deviations are in parenthesis. The maximum number of training samples was limited to 2000 (parenthesis in the N column), and the rest of them were used for test. The best results and those not significantly worse than them are highlighted in boldface (one-side t-test with 95% confidence.) data N D F ABtwo F ABshrink VBEM VBDP iris 150 4 -1.92 (0.76) -2.11 (0.68) -3.05 (0.64) -1.65 (0.57) yeast 1394 13 18.79 (1.81) 10.86 (3.38) 3.52 (3.80) 15.94 (0.72) cloud 2048 10 7.89 (2.81) 9.09 (2.62) 5.47 (2.62) 2.67 (2.28) wine quality 6497 (2000) 11 -2.68 (0.22) -2.68 (0.17) -10.49 (0.17) -16.52 (0.06) color moments 68040 (2000) 9 -6.97 (0.20) -6.95 (0.21) -7.58 (0.21) -7.66 (0.11) cooc texture 68040 (2000) 16 14.45 (0.22) 14.45 (0.46) 13.45 (0.13) 6.78 (0.28) spoken arabic digits 350319 (2000) 13 -12.94 (0.11) -12.90 (0.09) -13.81 (0.09) -14.34 (0.19) Figure 3: Comparisons of four methods in terms of KL (left), |C ∗ − C ? | (middle), and 3) CPU time (right). spectively under-fit and to over-fit their models, while we did not find such biases in F ABs. While KLs of F ABs are similar, F ABshrink performed better in terms of |C ∗ − C ? | because small components survived in F ABtwo . Further, both F ABs had a significant advantage over VBDP and VBEM in CPU time (right graph). In particular, while the CPU time of VBDP grew explosively over N , those of F ABs remained in a practical range. Since F ABshrink does not use a loop to optimize C, it was significantly faster than F ABtwo . 5.2.2 UCI data For the UCI datasets summarized in Table 2, we do not know their true distributions and therefore employed predictive log-likelihood as an evaluation metric. For large scale datasets, we randomly selected two thousands data for training and used the rest of the data for prediction because of the scalability issue in VBDP. As shown in Table 2, F ABs gave better performance then VBDP and VBEM with a sufficient number of data (the scale O(103 ) agrees with the results in the previous section). The trend of VBDP (under-fit) and VBEM (over-fit) was the same with the previous section. Also, while the predictive log-likelihood of F ABtwo and F ABshrink are competitive, F ABshrink 407 obtained more compact models (smaller C ∗ values) than F ABtwo . Unfortunately, we have no space to show the results here. 6 Summary and Future Work We have proposed approximation of marginal loglikelihood (FIC) and an inference method (FAB) for Bayesian model selection for mixture models, as an alternative to VB inference. We have given their justifications (asymptotic consistency, convergence, etc) and analyzed FAB mechanisms in terms of overfitting mitigation (shrinkage) and identifiability. Experimental results have shown that FAB outperforms state-ofthe-art VB methods for a practical number of data in terms of both model selection performance and computational efficiency. Our next step is extensions of FAB to more general non-regular and non-identifiable models which contain continuous latent variables (e.g., factor analyzer models [12] and matrix factorization models [17]), timedependent latent variables (e.g., hidden Markov models [16]), hierarchical latent variables [14, 4], etc. We believe that our idea, factorized representation of the marginal log-likelihood in which the asymptotic approximation works, is widely applicable to them. Ryohei Fujimaki, Satoshi Morinaga References [1] T. Ando. Bayesian Model Selection and Statistical Modeling. Chapman and Hall/CRC, 2010. [2] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. [14] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6:181–214, 1994. [15] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational dirichlet process mixture models. In Proceedings of Twentieth International Joint Conference on Artificial Intelligence, pages 2796–2801, 2007. [3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006. [16] D. J. MacKay. Ensemble learning for hidden Markov models. Technical report, University of Cambridge, 1997. [4] C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):281–293, 1998. [17] S. Nakajima and M. Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research, 12:2579–2644, 2011. [5] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–144, 2006. [6] A. Corduneanu and C. M. Bishop. Variational Bayesian model selection for mixture distributions. In Proceedings of Artifficial Intelligence and Statistics, pages 27–34, 2001. [7] I. Csiszar and P. C. Shields. The consistency of the BIC Markov order estimator. Annals of Statistics, 28(6):1601–1619, 2000. [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from imcomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1):1–38, 1977. [9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004. [10] A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010. [11] R. Fujimaki, Y. Sogawa, and S. Morinaga. Online heterogeneous mixture modeling with marginal and copula selection. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 645–653, 2011. [12] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural Information Processing Systems 12, pages 449–455. MIT Press, 2000. [13] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186:453–461, 1946. 408 [18] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. [19] M. Stone. Comments on model selection criteria of Akaike and Schwarz. Journal of the Royal Statistical Society: Series B, 41(2):276–278, 1979. [20] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 1378– 1385, 2006. [21] K. Watanabe and S. Watanabe. Stochastic complexities of Gaussian mixtures in variational Bayesian approximation. Journal of Machine Learning Research, 7:625–644, 2006. [22] S. Watanabe. Algebraic analysis for nonidentiable learning machines. Neural Computation, 13:899– 033, 2001. [23] S. Watanabe. Algebraic geometry and statistical learning. UK: Cambridge University Press, 2009. [24] K. Yamazaki and S. Watanabe. Mixture models and upper bounds of stochastic complexity. Neural Networks, 16:1029–1038, 2003.