Variational sampling approaches to word confusability John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath {jrhershe,pederao,rameshg}@us.ibm.com IBM, T. J. Watson Research Center Information Theory and Applications, 2/1-2007 – p.1/31 Abstract In speech recognition it is often useful to determine how confusable two words are. For speech models this comes down to computing the Bayes error between two HMMs. This problem is analytically and numerically intractable. A common alternative, that is numerically approachable, uses the KL divergence in place of the Bayes error. We present new approaches to approximating the KL divergence, that combine variational methods with importance sampling. The Bhattacharyya distance – a closer cousin of the Bayes error – turns out to be even more amenable to our approach. Our experiments demonstrate an improvement of orders of magnitude in accuracy over conventional methods. Information Theory and Applications, 2/1-2007 – p.2/31 Outline • Acoustic Confusability Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations • Bhattacharyya: Monte Carlo Approximation Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations • Bhattacharyya: Monte Carlo Approximation • Bhattacharyya: Variational Approximation Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations • Bhattacharyya: Monte Carlo Approximation • Bhattacharyya: Variational Approximation • Bhattacharyya: Variational Monte Carlo Approximation Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations • Bhattacharyya: Monte Carlo Approximation • Bhattacharyya: Variational Approximation • Bhattacharyya: Variational Monte Carlo Approximation • Bhattacharyya: Empirical Evaluations Information Theory and Applications, 2/1-2007 – p.3/31 Outline • Acoustic Confusability • Divergence Measures for Distributions • KL Divergence: Prior Art • KL Divergence: Variational Approximations • KL Divergence: Empirical Evaluations • Bhattacharyya: Monte Carlo Approximation • Bhattacharyya: Variational Approximation • Bhattacharyya: Variational Monte Carlo Approximation • Bhattacharyya: Empirical Evaluations • Future Directions Information Theory and Applications, 2/1-2007 – p.3/31 A Toy Version of the Confusability Problem probability density plot 0.5 N(x;−2,1) 0.45 0.4 0.35 pdf 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 • −4 −3 −2 −1 0 x 1 2 3 4 5 X is an acoustic feature vector representing a speech class, say "E", with pdf f (x) = N (x; −2, 1). Information Theory and Applications, 2/1-2007 – p.4/31 A Toy Version of the Confusability Problem probability density plot 0.5 N(x;−2,1) 0.45 N(x;2,1) 0.4 0.35 pdf 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 • • −4 −3 −2 −1 0 x 1 2 3 4 5 X is an acoustic feature vector representing a speech class, say "E", with pdf f (x) = N (x; −2, 1). Another speech class, "O", is described by pdf g(x) = N (x; 2, 1). Information Theory and Applications, 2/1-2007 – p.4/31 A Toy Version of the Confusability Problem probability density plot 0.5 N(x;−2,1) 0.45 N(x;2,1) 0.4 0.35 pdf 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 • • • −4 −3 −2 −1 0 x 1 2 3 4 5 X is an acoustic feature vector representing a speech class, say "E", with pdf f (x) = N (x; −2, 1). Another speech class, "O", is described by pdf g(x) = N (x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the R other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx. Information Theory and Applications, 2/1-2007 – p.4/31 A Toy Version of the Confusability Problem probability density plot 0.5 N(x;−2,1) 0.45 N(x;2,1) 0.4 0.35 pdf 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 • • • • −4 −3 −2 −1 0 x 1 2 3 4 5 X is an acoustic feature vector representing a speech class, say "E", with pdf f (x) = N (x; −2, 1). Another speech class, "O", is described by pdf g(x) = N (x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the R other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx. The Bayes Error is the total classification error Be (f, g) = R min(f (x), g(x))dx. Information Theory and Applications, 2/1-2007 – p.4/31 A Toy Version of the Confusability Problem probability density plot N(x;−2,1) 0.5 N(x;2,1) (fg)1/2 pdf 0.4 0.3 0.2 0.1 0 −5 • • • • −4 −3 −2 −1 0 x 1 2 3 4 5 X is an acoustic feature vector representing a speech class, say "E", with pdf f (x) = N (x; −2, 1). Another speech class, "O", is described by pdf g(x) = N (x; 2, 1). The asymmetric error is the probability that one class, "O", will be mistaken for the R other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx. The Bayes Error is the total classification error Be (f, g) = R min(f (x), g(x))dx. Information Theory and Applications, 2/1-2007 – p.4/31 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F . D AY AX L Information Theory and Applications, 2/1-2007 – p.5/31 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F . D CALL AY AX L has a pronunciation K AO L and HMM G K AO L Information Theory and Applications, 2/1-2007 – p.5/31 Word Models A word is modeled using its pronunciation(s) and an HMM. As an example the word DIAL has a pronunciation D AY AX L and HMM F . D CALL AY AX L has a pronunciation K AO L and HMM G K AO L Each node in the HMM has a GMM associated with it. The word confusability is the Bayes error Be (F, G). This quantity is too hard to compute!! Information Theory and Applications, 2/1-2007 – p.5/31 The Edit Distance DIAL CALL D K AY AX AO L L Total cost edit op. cost substitution 1 ins/del 1 substitution 1 none 0 3 Information Theory and Applications, 2/1-2007 – p.6/31 The Edit Distance DIAL CALL D K AY AX AO L L edit op. cost substitution 1 ins/del 1 substitution 1 none 0 Total cost 3 The edit distance is the shortest path in the graph: I D 1 K 1 1 AY 1 1 1 AO1 1 1 1 1 1 1 1 L 1 1 AX 1 1 1 1 1 1 1 1 1 1 L 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 F Information Theory and Applications, 2/1-2007 – p.6/31 Better ways Other techniques use weights on the edges. Acoustic perplexity and Average Divergence Distance are variants to this paradigm that use approximations to the KL divergence as weights. Information Theory and Applications, 2/1-2007 – p.7/31 Bayes Error We use Bayes Error approximations for each pair of GMMs in the Cartesian HMM products: I D:K D:K D:K AY : K AY : K AX : K L:K AX : K AY : K AX : K L:K D:K AY : K AX : K D : AO D : AO AY : AO AY : AO AX : AO L : AO AX : AO D : AO AY : AO AX : AO L : AO D : AO AY : AO AX : AO D:L D:L AY : L AY : L AX : L L:L AX : L L:L F Information Theory and Applications, 2/1-2007 – p.8/31 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g . We write X f (x) = πa fa (x), where fa (x) = N (x; µa ; Σa ), a Information Theory and Applications, 2/1-2007 – p.9/31 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g . We write X f (x) = πa fa (x), where fa (x) = N (x; µa ; Σa ), a and g(x) = X b ωb gb (x), where gb (x) = N (x; µb ; Σb ). Information Theory and Applications, 2/1-2007 – p.9/31 Gaussian Mixture Models Each node in the Cartesian HMM product corresponds to a pair of Gaussian Mixture models f and g . We write X f (x) = πa fa (x), where fa (x) = N (x; µa ; Σa ), a and g(x) = X b ωb gb (x), where gb (x) = N (x; µb ; Σb ). The high dimensionality of x ∈ Rd , d = 39, makes numerical integration difficult. Information Theory and Applications, 2/1-2007 – p.9/31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler • Bayes error: Be (f, g) = R min(f (x), g(x))dx Information Theory and Applications, 2/1-2007 – p.10/31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler • Bayes error: Be (f, g) = R min(f (x), g(x))dx R • Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx g(x) Information Theory and Applications, 2/1-2007 – p.10/31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler • Bayes error: Be (f, g) = R min(f (x), g(x))dx R • Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx g(x) Rp • Bhattacharyya distance: B(f, g) = f (x)g(x)dx Information Theory and Applications, 2/1-2007 – p.10/31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler • Bayes error: Be (f, g) = R min(f (x), g(x))dx R • Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx g(x) Rp • Bhattacharyya distance: B(f, g) = f (x)g(x)dx • Chernoff distance: C(f, g) = max0≤s≤1 Cs (f, g) and Cs (f, g) = R f (x)s g(x)1−s dx, 0 ≤ s ≤ 1. Information Theory and Applications, 2/1-2007 – p.10/31 Bayes, Bhattacharyya, Chernoff and Kullback-Leibler • Bayes error: Be (f, g) = R min(f (x), g(x))dx R • Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx g(x) Rp • Bhattacharyya distance: B(f, g) = f (x)g(x)dx • Chernoff distance: C(f, g) = max0≤s≤1 Cs (f, g) and Cs (f, g) = R f (x)s g(x)1−s dx, 0 ≤ s ≤ 1. Why these? — For a pair of single Gaussians f and g we can compute D(f kg), B(f, g) and Cs (f, g) analytically. Information Theory and Applications, 2/1-2007 – p.10/31 Connections • Perimeter divergence (power mean): P α (f, g) = Z 1 1 f (x)α + g(x)α 2 2 1 α dx. We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g). Information Theory and Applications, 2/1-2007 – p.11/31 Connections • Perimeter divergence (power mean): P α (f, g) = • Z 1 1 f (x)α + g(x)α 2 2 1 α dx. We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: 1 Ds (f kg) = log s−1 Z f (x)s g(x)1−s dx. We have D1 (f kg) = D(f kg) and Ds (f kg) = 1 s−1 log Cs (f kg). Information Theory and Applications, 2/1-2007 – p.11/31 Connections • Perimeter divergence (power mean): P α (f, g) = • Z 1 1 f (x)α + g(x)α 2 2 1 α dx. We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g). Rényi generalised divergence of order s: 1 Ds (f kg) = log s−1 • Z f (x)s g(x)1−s dx. We have D1 (f kg) = D(f kg) and Ds (f kg) = Generalisation: 1 Gα,s (f kg) = log s−1 Z 1 s−1 log Cs (f kg). 1 (sf (x)α + (1 − s)g(x)α ) α . Gα,s (f kg) connects log Be (f, g), D(f kg), log B(f, g) and Cs (f, g). Information Theory and Applications, 2/1-2007 – p.11/31 Relations and Inequalities • Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and 2 G−∞,s (f kg) = Be (f, g). Information Theory and Applications, 2/1-2007 – p.12/31 Relations and Inequalities • Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and 2 G−∞,s (f kg) = Be (f, g). • P α (f, g) ≤ P β (f, g) for α ≤ β and B(f, g) ≤ p P α (f, g)P −α (f, g). Information Theory and Applications, 2/1-2007 – p.12/31 Relations and Inequalities • Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and 2 G−∞,s (f kg) = Be (f, g). • P α (f, g) ≤ P β (f, g) for α ≤ β and B(f, g) ≤ p P α (f, g)P −α (f, g). • P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2, Be (f, g) = 2 − P ∞ (f, g). Information Theory and Applications, 2/1-2007 – p.12/31 Relations and Inequalities • Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and 2 G−∞,s (f kg) = Be (f, g). • P α (f, g) ≤ P β (f, g) for α ≤ β and B(f, g) ≤ p P α (f, g)P −α (f, g). • P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2, Be (f, g) = 2 − P ∞ (f, g). • B(f, g) = B(g, f ), Be (f, g) = Be (g, f ), D(f kg) 6= D(gkf ). Information Theory and Applications, 2/1-2007 – p.12/31 Relations and Inequalities • Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and 2 G−∞,s (f kg) = Be (f, g). • P α (f, g) ≤ P β (f, g) for α ≤ β and B(f, g) ≤ p P α (f, g)P −α (f, g). • P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2, Be (f, g) = 2 − P ∞ (f, g). • B(f, g) = B(g, f ), Be (f, g) = Be (g, f ), D(f kg) 6= D(gkf ). • D(f kg) ≥ 2 log B(f, g) ≥ 2 log Be (f, g), p Be (f, g) ≤ B(f, g) ≤ Be (f, g)(2 − Be (f, g)), Be (f, g) ≤ C(f, g) ≤ B(f, g), C0 (f, g) = C1 (f, g) = 1 and B(f, g) = C1/2 (f, g). and so on. Information Theory and Applications, 2/1-2007 – p.12/31 The KL Divergence of a GMM • Monte Carlo sampling: Draw n samples from f . Then n X 1 f (xi ) D(f kg) ≈ log n g(xi ) i=1 √ with error O(1/ n). Information Theory and Applications, 2/1-2007 – p.13/31 The KL Divergence of a GMM • Monte Carlo sampling: Draw n samples from f . Then n X 1 f (xi ) D(f kg) ≈ log n g(xi ) i=1 √ with error O(1/ n). • Gaussian Approximation: Approximate f with a gaussian fˆ whose mean and covariance matches the total mean and covariance of f . Same for g and ĝ , then use D(f kg) ≈ D(fˆkĝ) µfˆ = X πa µa a Σfˆ = X a πa (Σa + (µa − µfˆ)(µa − µfˆ)T ). Information Theory and Applications, 2/1-2007 – p.13/31 Unscented Approximation It is possible to pick 2d “sigma” points {xa,k }2d k=1 such that R 1 P2d fa (x)h(x)dx = 2d k=1 h(xa,k ) is exact for all quadratic functions h. One choice of sigma points is xa,k = µa + xa,d+k = µa − p p dλa,k ea,k dλa,k ea,k , This is akin to gaussian quadrature. Information Theory and Applications, 2/1-2007 – p.14/31 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argminb (D(fa kgb ) − log(ωb )). Goldberger’s approximate formula is: D(f kg) ≈ DGoldberger (f kg) = X a πa D(fa kgm(a) ) + log πa ωm(a) . Analogous to the chain rule for relative entropy. Information Theory and Applications, 2/1-2007 – p.15/31 Matched Bound Approximation Match the closest pairs of gaussians m(a) = argminb (D(fa kgb ) − log(ωb )). Goldberger’s approximate formula is: D(f kg) ≈ DGoldberger (f kg) = X a πa D(fa kgm(a) ) + log πa ωm(a) . Analogous to the chain rule for relative entropy. Min Approximation: D(f kg) ≈ mina,b D(fa kgb ) is an approximation in the same spirit. Information Theory and Applications, 2/1-2007 – p.15/31 Variational Approximation Let 1 = P b φb|a Z be free parameters f log g = X πa Z πa Z a = X a fa log X ωb gb b fa log X b ωb gb φb|a φb|a Introduce the variational parameters Information Theory and Applications, 2/1-2007 – p.16/31 Variational Approximation Let 1 = P b φb|a Z be free parameters f log g = X πa Z fa log πa Z X a ≥ X a X b fa b ωb gb φb|a φb|a ωb gb φb|a log φb|a Use Jensen’s inequality to interchange log and P b φb|a Information Theory and Applications, 2/1-2007 – p.16/31 Variational Approximation Let 1 = Z P b φb|a be free parameters f log g = X a πa Z fa log X b ωb gb φb|a φb|a ωb gb ≥ πa fa φb|a log φb|a a b Z X X = πa φb|a log(ωb /φb|a ) + fa log gb X a Z X b Simplify expression after using Jensen Information Theory and Applications, 2/1-2007 – p.16/31 Variational Approximation Let 1 = Z P b φb|a be free parameters f log g = X a πa Z fa log X b ωb gb φb|a φb|a ωb gb ≥ πa fa φb|a log φb|a a b Z X X = πa φb|a log(ωb /φb|a ) + fa log gb X a Z X b Maximize over φb|a and do the same for R f log f to get Information Theory and Applications, 2/1-2007 – p.16/31 Variational Approximation Let 1 = Z P b φb|a be free parameters f log g = X a πa Z fa log X b ωb gb φb|a φb|a ωb gb ≥ πa fa φb|a log φb|a a b Z X X = πa φb|a log(ωb /φb|a ) + fa log gb X a Z X b Maximize over φb|a and do the same for D(f kg) ≈ Dvar (f kg) = X a R f log f to get P −D(fa kfa′ ) ′e π ′ a πa log Pa . −D(f kg ) a b b ωb e Information Theory and Applications, 2/1-2007 – p.16/31 Variational Upper Bound Variational parameters: Gaussian replication: f g P b φb|a = πa and P a ψa|b = ωb . P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb . Information Theory and Applications, 2/1-2007 – p.17/31 Variational Upper Bound Variational parameters: Gaussian replication: f g P b φb|a = πa and P a ψa|b = ωb . P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb . D(f kg) = Z f log(f /g) P Z ψ g ab a|b b = − f log f Introduce the variational parameters ψa|b . Information Theory and Applications, 2/1-2007 – p.17/31 Variational Upper Bound P b φb|a = πa and D(f kg) = − Z f log P = − Z Variational parameters: f log P a ψa|b ab ψa|b gb f X φb|a fa ψa|b gb ab f φb|a fa = ωb . ! dx Prepare to use Jensen’s inequality. Information Theory and Applications, 2/1-2007 – p.17/31 Variational Upper Bound P = πa and D(f kg) = − Z = − Z P ab φb|a fa f log P ab ψa|b gb Variational parameters: b φb|a f log P a ψa|b X φb|a fa ψa|b gb = ωb . f φb|a fa Z X ψa|b gb ≤ dx φb|a fa log φb|a fa ab ! dx ab Interchange log and φb|a fa ab f . P Information Theory and Applications, 2/1-2007 – p.17/31 Variational Upper Bound Variational parameters: P b φb|a = πa and P a ψa|b = ωb . P ab φb|a fa D(f kg) = − f log P ab ψa|b gb Z X ψa|b gb ≤ φb|a fa log dx φb|a fa ab X φb|a D(fa kgb ) = D(φkψ) + Z ab The chain rule for relative entropy for mixtures with unequal number of components!! Information Theory and Applications, 2/1-2007 – p.17/31 Variational Upper Bound Optimize variational bound D(f kg) ≤ D(φkψ) + with respect to constraints P b φb|a X ab φb|a D(fa kgb ) = πa and P a ψa|b = ωb . Information Theory and Applications, 2/1-2007 – p.18/31 Variational Upper Bound Optimize variational bound D(f kg) ≤ D(φkψ) + X ab φb|a D(fa kgb ) P with respect to constraints b φb|a = πa and Fix φ, find optimal value for ψ ψa|b P a ψa|b = ωb . ωb φb|a . =P ′ φ a′ b|a Fix ψ , find optimal value for φ: φb|a πa ψa|b e−D(fa kgb ) =P . −D(f kg ′) a b b′ ψa|b′ e Iterate a few times to find optimal solution! Information Theory and Applications, 2/1-2007 – p.18/31 Comparison of KL divergence methods (1) Histograms of difference between closed form approximations and Monte Carlo Sampling with 1 million samples: 0.5 Probability 0.4 0.3 zero product min gaussian variational 0.2 0.1 0 −60 −50 −40 −30 −20 −10 Deviation from DMC(1M) 0 10 20 Information Theory and Applications, 2/1-2007 – p.19/31 Comparison of KL divergence methods (2) Histograms of difference between various methods and Monte Carlo Sampling with 1 million samples: Probability 1 0.8 MC(2dn) unscented goldberger variational upper 0.6 0.4 0.2 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from DMC(1M) Information Theory and Applications, 2/1-2007 – p.20/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form • Min Approximation – cheap, but not good Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form • Min Approximation – cheap, but not good • Unscented Approximation – not as good as MC per cost Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form • Min Approximation – cheap, but not good • Unscented Approximation – not as good as MC per cost • Goldberger Approximation – cheap, good, not closed form Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form • Min Approximation – cheap, but not good • Unscented Approximation – not as good as MC per cost • Goldberger Approximation – cheap, good, not closed form • Variational Approximation – cheap, good, closed form Information Theory and Applications, 2/1-2007 – p.21/31 Summary of KL divergence methods • Monte Carlo Sampling – arbitrary accuracy, arbitrary cost • Gaussian Approximation – not cheap, not good, closed form • Min Approximation – cheap, but not good • Unscented Approximation – not as good as MC per cost • Goldberger Approximation – cheap, good, not closed form • Variational Approximation – cheap, good, closed form • Variational Upper Bound – cheap, good, upper-bound Information Theory and Applications, 2/1-2007 – p.21/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.22/31 Bhattacharyya distance R√ f g , can be estimate The Bhattacharyya distance, B(f, g) = using Monte Carlo sampling from an arbitrary distribution h: 1 B(f, g) ≈ B̂h = n n X i=1 p f (xi )g(xi ) , h(xi ) where {xi }ni=1 are sampled from h. Information Theory and Applications, 2/1-2007 – p.23/31 Bhattacharyya distance R√ f g , can be estimate The Bhattacharyya distance, B(f, g) = using Monte Carlo sampling from an arbitrary distribution h: 1 B(f, g) ≈ B̂h = n n X i=1 p f (xi )g(xi ) , h(xi ) where {xi }ni=1 are sampled from h. The estimators are unbiased, E[B̂h ] = B(f, g), with variance Z fg 1 var(B̂h ) = − (B(f, g))2 . n h Information Theory and Applications, 2/1-2007 – p.23/31 Bhattacharyya distance R√ f g , can be estimate The Bhattacharyya distance, B(f, g) = using Monte Carlo sampling from an arbitrary distribution h: 1 B(f, g) ≈ B̂h = n n X i=1 p f (xi )g(xi ) , h(xi ) where {xi }ni=1 are sampled from h. The estimators are unbiased, E[B̂h ] = B(f, g), with variance Z fg 1 var(B̂h ) = − (B(f, g))2 . n h h= h= 1−(B(f,g))2 f gives var(B̂f ) = . n 2f g −(B(f,g))2 f +g f +g f +g . 2 gives var(B̂ 2 ) = n And var(B̂ f +g ) ≤ var(B̂f ) (Harmonic–Arithmetic inequality). 2 Information Theory and Applications, 2/1-2007 – p.23/31 Best sampling distribution We can find the “best” sampling distribution h by minimizing the R variance of B̂h subject to the constraints h ≥ 0 and h = 1. The solution is h = √ √f g , fg var(B̂h ) = 0. Information Theory and Applications, 2/1-2007 – p.24/31 Best sampling distribution We can find the “best” sampling distribution h by minimizing the R variance of B̂h subject to the constraints h ≥ 0 and h = 1. The solution is h = √ √f g , fg var(B̂h ) = 0. Unfortunately, using this h requires • Computing the quantity, B(f, g) = to compute in the first place. √ • Sample from f g . R√ f g , that we are trying Information Theory and Applications, 2/1-2007 – p.24/31 Best sampling distribution We can find the “best” sampling distribution h by minimizing the R variance of B̂h subject to the constraints h ≥ 0 and h = 1. The solution is h = √ √f g , fg var(B̂h ) = 0. Unfortunately, using this h requires • Computing the quantity, B(f, g) = to compute in the first place. √ • Sample from f g . R√ f g , that we are trying We will use variational techniques to “approximate” √ f g with some unnormalized h that can be analytically integrated to give a genuine pdf h. Information Theory and Applications, 2/1-2007 – p.24/31 Bhattacharyya Variational Upper Bound f g P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb P b φb|a = πa P a ψa|b = ωb . Information Theory and Applications, 2/1-2007 – p.25/31 Bhattacharyya Variational Upper Bound f g P P P = a πa fa = ab φb|a fa b φb|a = πa P P P = = b ωb gb ba ψa|b gb a ψa|b = ωb . Z r g B(f, g) = f f sP Z ab ψa|b gb = f f Introduce the variational parameters ψa|b . Information Theory and Applications, 2/1-2007 – p.25/31 Bhattacharyya Variational Upper Bound f g P b φb|a = πa P a ψa|b = ωb . P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb B(f, g) = Z = Z f sP f s X φb|a fa ψa|b gb ab ψa|b gb ab f f φb|a fa dx Prepare to use Jensen’s inequality. Information Theory and Applications, 2/1-2007 – p.25/31 Bhattacharyya Variational Upper Bound f g P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb B(f, g) = Z = Z P b φb|a = πa P a ψa|b = ωb . f sP f s X φb|a fa ψa|b gb ab ψa|b gb f dx f φb|a fa s Z X ψa|b gb ≥ φb|a fa dx φb|a fa ab ab Interchange √ · and φb|a fa ab f . P Information Theory and Applications, 2/1-2007 – p.25/31 Bhattacharyya Variational Upper Bound f g P b φb|a = πa P a ψa|b = ωb . P P = a πa fa = ab φb|a fa P P = = b ωb gb ba ψa|b gb B(f, g) = ≥ f sP X Z Z ab φb|a ab ψa|b gb f fa s ψa|b gb dx φb|a fa Xq φb|a ψa|b B(fa , gb ) = ab An inequality linking the mixture Bhattacharyya distance to the component distances!! Information Theory and Applications, 2/1-2007 – p.25/31 Bhattacharyya Variational Upper Bound Optimize variational bound Xq B(f, g) ≥ φb|a ψa|b B(fa , gb ) ab with respect to constraints P b φb|a = πa and P a ψa|b = ωb . Information Theory and Applications, 2/1-2007 – p.26/31 Bhattacharyya Variational Upper Bound Optimize variational bound Xq B(f, g) ≥ φb|a ψa|b B(fa , gb ) ab P with respect to constraints b φb|a = πa and Fix φ, find optimal value for ψ ψa|b P a ψa|b = ωb . ωb φb|a (B(fa , gb ))2 =P . 2 a′ φb|a′ (B(fa′ , gb )) Fix ψ , find optimal value for φ: φb|a πa ψa|b (B(fa , gb ))2 =P . 2 b′ ψa|b′ (B(fa , gb′ )) Iterate to find optimal solution! Information Theory and Applications, 2/1-2007 – p.26/31 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = Xq φb|a ψa|b B(fa , gb ) = ab Z Xq φb|a ψa|b p fa gb = Z ĥ. ab q P Here ĥ = φ √ ψ b|a a|b fa gb is an unnormalized approximation of the optimal ab R√ √ sampling distribution, f g/ f g. Information Theory and Applications, 2/1-2007 – p.27/31 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = Xq φb|a ψa|b B(fa , gb ) = Z Xq ab φb|a ψa|b p fa gb = Z ĥ. ab q P Here ĥ = φ √ ψ b|a a|b fa gb is an unnormalized approximation of the optimal ab R√ √ sampling distribution, f g/ f g. R R√ √ fa gb is a gaussian and h = ĥ/ ĥ is a GMM since hab = fa gb / h= X ab q πab hab , where πab = φb|a ψa|b R√ fa gb V (f, g) Information Theory and Applications, 2/1-2007 – p.27/31 Variational Monte Carlo Sampling Write the variational estimate V (f, g) = Xq φb|a ψa|b B(fa , gb ) = Z Xq φb|a ψa|b ab p fa gb = Z ĥ. ab q P Here ĥ = φ √ ψ b|a a|b fa gb is an unnormalized approximation of the optimal ab R√ √ sampling distribution, f g/ f g. R R√ √ fa gb is a gaussian and h = ĥ/ ĥ is a GMM since hab = fa gb / h= X q πab hab , where πab = φb|a ψa|b ab R√ fa gb V (f, g) Thus, drawing samples {xi }n i=1 from h the estimate V̂n = n p X f (xi )g(xi ) 1 n i=1 h(xi ) is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2. Information Theory and Applications, 2/1-2007 – p.27/31 Bhattacharyya Distance: Monte Carlo estimation Bhattacharyya: Importance sampling from f (x) 3.5 Probability density 10 3 2.5 2 1.5 1 0.5 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Bhattacharyya: Importance sampling from f (x) Probability density 3.5 10 100 3 2.5 2 1.5 1 0.5 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Bhattacharyya: Importance sampling from f (x) Probability density 3.5 10 100 1000 3 2.5 2 1.5 1 0.5 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Bhattacharyya: Importance sampling from f (x) Probability density 3.5 10 100 1000 10K 3 2.5 2 1.5 1 0.5 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Bhattacharyya: Importance sampling from f (x) Probability density 3.5 10 100 1000 10K 100K 3 2.5 2 1.5 1 0.5 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Probability density Importance sampling from f (x)+g(x) 2 f 100K 10 15 10 5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Probability density Importance sampling from f (x)+g(x) 2 f 100K 10 100 15 10 5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Probability density Importance sampling from f (x)+g(x) 2 f 100K 10 100 1000 15 10 5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Probability density Importance sampling from f (x)+g(x) 2 f 100K 10 100 1000 10K 15 10 5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Probability density Importance sampling from f (x)+g(x) 2 f 100K 10 100 1000 10K 100K 15 10 5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 (f+g)/2 100K 10 100 80 60 40 20 0 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 (f+g)/2 100K 10 100 100 80 60 40 20 0 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 (f+g)/2 100K 10 100 1000 100 80 60 40 20 0 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 (f+g)/2 100K 10 100 1000 10K 100 80 60 40 20 0 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 (f+g)/2 100K 10 100 1000 10K 100K 100 80 60 40 20 0 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Bhattacharyya Distance: Monte Carlo estimation Variational importance sampling Probability density 120 f 100K (f+g)/2 100K variational 100K 100 80 60 40 20 0 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Deviation from Bhattacharyya estimate with 1M samples • Importance sampling from f (x): Slow convergence. • Importance sampling from f (x)+g(x) : 2 Better convergence. • Variational importance sampling: Fast convergence. Information Theory and Applications, 2/1-2007 – p.28/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.29/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.29/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.29/31 Comparison of KL, Bhattacharyya, Bayes Information Theory and Applications, 2/1-2007 – p.29/31 Variational Monte Carlo Sampling: KL-Divergence Information Theory and Applications, 2/1-2007 – p.30/31 Future Directions • HMM variational KL divergence Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance • Variational sampling of Bayes error using Chernoff approximation Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance • Variational sampling of Bayes error using Chernoff approximation • Discriminative training, using Bhattacharyya divergence. Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance • Variational sampling of Bayes error using Chernoff approximation • Discriminative training, using Bhattacharyya divergence. • Acoustic confusability using Bhattacharyya divergence Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance • Variational sampling of Bayes error using Chernoff approximation • Discriminative training, using Bhattacharyya divergence. • Acoustic confusability using Bhattacharyya divergence • Clustering of HMMs Information Theory and Applications, 2/1-2007 – p.31/31 Future Directions • HMM variational KL divergence • HMM variational Bhattacharyya • Variational Chernoff distance • Variational sampling of Bayes error using Chernoff approximation • Discriminative training, using Bhattacharyya divergence. • Acoustic confusability using Bhattacharyya divergence • Clustering of HMMs Information Theory and Applications, 2/1-2007 – p.31/31