Variational sampling approaches to word confusability

Variational sampling approaches
to word confusability
John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath
{jrhershe,pederao,rameshg}@us.ibm.com
IBM, T. J. Watson Research Center
Information Theory and Applications, 2/1-2007 – p.1/31
Abstract
In speech recognition it is often useful to determine how confusable two words are. For
speech models this comes down to computing the Bayes error between two HMMs. This
problem is analytically and numerically intractable. A common alternative, that is numerically approachable, uses the KL divergence in place of the Bayes error. We present new
approaches to approximating the KL divergence, that combine variational methods with
importance sampling. The Bhattacharyya distance – a closer cousin of the Bayes error –
turns out to be even more amenable to our approach. Our experiments demonstrate an
improvement of orders of magnitude in accuracy over conventional methods.
Information Theory and Applications, 2/1-2007 – p.2/31
Outline
• Acoustic Confusability
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
• Bhattacharyya: Monte Carlo Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
• Bhattacharyya: Monte Carlo Approximation
• Bhattacharyya: Variational Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
• Bhattacharyya: Monte Carlo Approximation
• Bhattacharyya: Variational Approximation
• Bhattacharyya: Variational Monte Carlo Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
• Bhattacharyya: Monte Carlo Approximation
• Bhattacharyya: Variational Approximation
• Bhattacharyya: Variational Monte Carlo Approximation
• Bhattacharyya: Empirical Evaluations
Information Theory and Applications, 2/1-2007 – p.3/31
Outline
• Acoustic Confusability
• Divergence Measures for Distributions
• KL Divergence: Prior Art
• KL Divergence: Variational Approximations
• KL Divergence: Empirical Evaluations
• Bhattacharyya: Monte Carlo Approximation
• Bhattacharyya: Variational Approximation
• Bhattacharyya: Variational Monte Carlo Approximation
• Bhattacharyya: Empirical Evaluations
• Future Directions
Information Theory and Applications, 2/1-2007 – p.3/31
A Toy Version of the Confusability Problem
probability density plot
0.5
N(x;−2,1)
0.45
0.4
0.35
pdf
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
•
−4
−3
−2
−1
0
x
1
2
3
4
5
X is an acoustic feature vector representing a speech class, say "E", with pdf
f (x) = N (x; −2, 1).
Information Theory and Applications, 2/1-2007 – p.4/31
A Toy Version of the Confusability Problem
probability density plot
0.5
N(x;−2,1)
0.45
N(x;2,1)
0.4
0.35
pdf
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
•
•
−4
−3
−2
−1
0
x
1
2
3
4
5
X is an acoustic feature vector representing a speech class, say "E", with pdf
f (x) = N (x; −2, 1).
Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
Information Theory and Applications, 2/1-2007 – p.4/31
A Toy Version of the Confusability Problem
probability density plot
0.5
N(x;−2,1)
0.45
N(x;2,1)
0.4
0.35
pdf
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
•
•
•
−4
−3
−2
−1
0
x
1
2
3
4
5
X is an acoustic feature vector representing a speech class, say "E", with pdf
f (x) = N (x; −2, 1).
Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
The asymmetric error is the probability that one class, "O", will be mistaken for the
R
other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx.
Information Theory and Applications, 2/1-2007 – p.4/31
A Toy Version of the Confusability Problem
probability density plot
0.5
N(x;−2,1)
0.45
N(x;2,1)
0.4
0.35
pdf
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
•
•
•
•
−4
−3
−2
−1
0
x
1
2
3
4
5
X is an acoustic feature vector representing a speech class, say "E", with pdf
f (x) = N (x; −2, 1).
Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
The asymmetric error is the probability that one class, "O", will be mistaken for the
R
other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx.
The Bayes Error is the total classification error Be (f, g) =
R
min(f (x), g(x))dx.
Information Theory and Applications, 2/1-2007 – p.4/31
A Toy Version of the Confusability Problem
probability density plot
N(x;−2,1)
0.5
N(x;2,1)
(fg)1/2
pdf
0.4
0.3
0.2
0.1
0
−5
•
•
•
•
−4
−3
−2
−1
0
x
1
2
3
4
5
X is an acoustic feature vector representing a speech class, say "E", with pdf
f (x) = N (x; −2, 1).
Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
The asymmetric error is the probability that one class, "O", will be mistaken for the
R
other, "E", when classifying according to Ae (f, g) = f (x)1g(x)≥f (x) (x)dx.
The Bayes Error is the total classification error Be (f, g) =
R
min(f (x), g(x))dx.
Information Theory and Applications, 2/1-2007 – p.4/31
Word Models
A word is modeled using its pronunciation(s) and an HMM. As
an example the word DIAL has a pronunciation D AY AX L and
HMM F .
D
AY
AX
L
Information Theory and Applications, 2/1-2007 – p.5/31
Word Models
A word is modeled using its pronunciation(s) and an HMM. As
an example the word DIAL has a pronunciation D AY AX L and
HMM F .
D
CALL
AY
AX
L
has a pronunciation K AO L and HMM G
K
AO
L
Information Theory and Applications, 2/1-2007 – p.5/31
Word Models
A word is modeled using its pronunciation(s) and an HMM. As
an example the word DIAL has a pronunciation D AY AX L and
HMM F .
D
CALL
AY
AX
L
has a pronunciation K AO L and HMM G
K
AO
L
Each node in the HMM has a GMM associated with it. The word
confusability is the Bayes error Be (F, G). This quantity is too hard
to compute!!
Information Theory and Applications, 2/1-2007 – p.5/31
The Edit Distance
DIAL
CALL
D
K
AY
AX
AO
L
L
Total cost
edit op.
cost
substitution
1
ins/del
1
substitution
1
none
0
3
Information Theory and Applications, 2/1-2007 – p.6/31
The Edit Distance
DIAL
CALL
D
K
AY
AX
AO
L
L
edit op.
cost
substitution
1
ins/del
1
substitution
1
none
0
Total cost
3
The edit distance is the shortest path in the graph:
I
D
1
K 1
1
AY
1
1
1
AO1
1
1
1
1
1
1
1
L 1
1
AX
1
1
1
1
1
1
1
1
1
1
L
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
F
Information Theory and Applications, 2/1-2007 – p.6/31
Better ways
Other techniques use weights on the edges. Acoustic
perplexity and Average Divergence Distance are variants
to this paradigm that use approximations to the KL
divergence as weights.
Information Theory and Applications, 2/1-2007 – p.7/31
Bayes Error
We use Bayes Error approximations for each pair of GMMs in
the Cartesian HMM products:
I
D:K
D:K
D:K
AY : K
AY : K
AX : K
L:K
AX : K
AY : K
AX : K
L:K
D:K
AY : K
AX : K
D : AO
D : AO
AY : AO
AY : AO
AX : AO
L : AO
AX : AO
D : AO
AY : AO
AX : AO
L : AO
D : AO
AY : AO
AX : AO
D:L
D:L
AY : L
AY : L
AX : L
L:L
AX : L
L:L
F
Information Theory and Applications, 2/1-2007 – p.8/31
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pair
of Gaussian Mixture models f and g . We write
X
f (x) =
πa fa (x), where fa (x) = N (x; µa ; Σa ),
a
Information Theory and Applications, 2/1-2007 – p.9/31
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pair
of Gaussian Mixture models f and g . We write
X
f (x) =
πa fa (x), where fa (x) = N (x; µa ; Σa ),
a
and
g(x) =
X
b
ωb gb (x),
where gb (x) = N (x; µb ; Σb ).
Information Theory and Applications, 2/1-2007 – p.9/31
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pair
of Gaussian Mixture models f and g . We write
X
f (x) =
πa fa (x), where fa (x) = N (x; µa ; Σa ),
a
and
g(x) =
X
b
ωb gb (x),
where gb (x) = N (x; µb ; Σb ).
The high dimensionality of x ∈ Rd , d = 39, makes numerical
integration difficult.
Information Theory and Applications, 2/1-2007 – p.9/31
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be (f, g) =
R
min(f (x), g(x))dx
Information Theory and Applications, 2/1-2007 – p.10/31
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be (f, g) =
R
min(f (x), g(x))dx
R
• Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx
g(x)
Information Theory and Applications, 2/1-2007 – p.10/31
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be (f, g) =
R
min(f (x), g(x))dx
R
• Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx
g(x)
Rp
• Bhattacharyya distance: B(f, g) =
f (x)g(x)dx
Information Theory and Applications, 2/1-2007 – p.10/31
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be (f, g) =
R
min(f (x), g(x))dx
R
• Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx
g(x)
Rp
• Bhattacharyya distance: B(f, g) =
f (x)g(x)dx
• Chernoff distance: C(f, g) = max0≤s≤1 Cs (f, g) and
Cs (f, g) =
R
f (x)s g(x)1−s dx, 0 ≤ s ≤ 1.
Information Theory and Applications, 2/1-2007 – p.10/31
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be (f, g) =
R
min(f (x), g(x))dx
R
• Kullback Leibler divergence: D(f kg) = f (x) log f (x) dx
g(x)
Rp
• Bhattacharyya distance: B(f, g) =
f (x)g(x)dx
• Chernoff distance: C(f, g) = max0≤s≤1 Cs (f, g) and
Cs (f, g) =
R
f (x)s g(x)1−s dx, 0 ≤ s ≤ 1.
Why these? — For a pair of single Gaussians f and g we can
compute D(f kg), B(f, g) and Cs (f, g) analytically.
Information Theory and Applications, 2/1-2007 – p.10/31
Connections
•
Perimeter divergence (power mean):
P α (f, g) =
Z 1
1
f (x)α + g(x)α
2
2
1
α
dx.
We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g).
Information Theory and Applications, 2/1-2007 – p.11/31
Connections
•
Perimeter divergence (power mean):
P α (f, g) =
•
Z 1
1
f (x)α + g(x)α
2
2
1
α
dx.
We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g).
Rényi generalised divergence of order s:
1
Ds (f kg) =
log
s−1
Z
f (x)s g(x)1−s dx.
We have D1 (f kg) = D(f kg) and Ds (f kg) =
1
s−1
log Cs (f kg).
Information Theory and Applications, 2/1-2007 – p.11/31
Connections
•
Perimeter divergence (power mean):
P α (f, g) =
•
Z 1
1
f (x)α + g(x)α
2
2
1
α
dx.
We have Be (f, g) = P −∞ (f, g), B(f, g) = P 0 (f, g).
Rényi generalised divergence of order s:
1
Ds (f kg) =
log
s−1
•
Z
f (x)s g(x)1−s dx.
We have D1 (f kg) = D(f kg) and Ds (f kg) =
Generalisation:
1
Gα,s (f kg) =
log
s−1
Z
1
s−1
log Cs (f kg).
1
(sf (x)α + (1 − s)g(x)α ) α .
Gα,s (f kg) connects log Be (f, g), D(f kg), log B(f, g) and Cs (f, g).
Information Theory and Applications, 2/1-2007 – p.11/31
Relations and Inequalities
• Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and
2
G−∞,s (f kg) = Be (f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
Relations and Inequalities
• Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and
2
G−∞,s (f kg) = Be (f, g).
• P α (f, g) ≤ P β (f, g) for α ≤ β and
B(f, g) ≤
p
P α (f, g)P −α (f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
Relations and Inequalities
• Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and
2
G−∞,s (f kg) = Be (f, g).
• P α (f, g) ≤ P β (f, g) for α ≤ β and
B(f, g) ≤
p
P α (f, g)P −α (f, g).
• P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2,
Be (f, g) = 2 − P ∞ (f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
Relations and Inequalities
• Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and
2
G−∞,s (f kg) = Be (f, g).
• P α (f, g) ≤ P β (f, g) for α ≤ β and
B(f, g) ≤
p
P α (f, g)P −α (f, g).
• P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2,
Be (f, g) = 2 − P ∞ (f, g).
• B(f, g) = B(g, f ), Be (f, g) = Be (g, f ), D(f kg) 6= D(gkf ).
Information Theory and Applications, 2/1-2007 – p.12/31
Relations and Inequalities
• Gα, 1 (f kg) = −2 log P α (f, g), G0,s (f kg) = Ds (f kg) and
2
G−∞,s (f kg) = Be (f, g).
• P α (f, g) ≤ P β (f, g) for α ≤ β and
B(f, g) ≤
p
P α (f, g)P −α (f, g).
• P α (f, f ) = 1, Gα,s (f, f ) = 0, P −∞ (f, g) + P ∞ (f, g) = 2,
Be (f, g) = 2 − P ∞ (f, g).
• B(f, g) = B(g, f ), Be (f, g) = Be (g, f ), D(f kg) 6= D(gkf ).
• D(f kg) ≥ 2 log B(f, g) ≥ 2 log Be (f, g),
p
Be (f, g) ≤ B(f, g) ≤ Be (f, g)(2 − Be (f, g)),
Be (f, g) ≤ C(f, g) ≤ B(f, g), C0 (f, g) = C1 (f, g) = 1 and
B(f, g) = C1/2 (f, g).
and so on.
Information Theory and Applications, 2/1-2007 – p.12/31
The KL Divergence of a GMM
• Monte Carlo sampling: Draw n samples from f . Then
n
X
1
f (xi )
D(f kg) ≈
log
n
g(xi )
i=1
√
with error O(1/ n).
Information Theory and Applications, 2/1-2007 – p.13/31
The KL Divergence of a GMM
• Monte Carlo sampling: Draw n samples from f . Then
n
X
1
f (xi )
D(f kg) ≈
log
n
g(xi )
i=1
√
with error O(1/ n).
• Gaussian Approximation: Approximate f with a gaussian fˆ
whose mean and covariance matches the total mean and
covariance of f . Same for g and ĝ , then use
D(f kg) ≈ D(fˆkĝ)
µfˆ =
X
πa µa
a
Σfˆ =
X
a
πa (Σa + (µa − µfˆ)(µa − µfˆ)T ).
Information Theory and Applications, 2/1-2007 – p.13/31
Unscented Approximation
It is possible to pick 2d “sigma” points {xa,k }2d
k=1 such that
R
1 P2d
fa (x)h(x)dx = 2d k=1 h(xa,k ) is exact for all quadratic
functions h. One choice of sigma points is
xa,k = µa +
xa,d+k = µa −
p
p
dλa,k ea,k
dλa,k ea,k ,
This is akin to gaussian quadrature.
Information Theory and Applications, 2/1-2007 – p.14/31
Matched Bound Approximation
Match the closest pairs of gaussians
m(a) = argminb (D(fa kgb ) − log(ωb )).
Goldberger’s approximate formula is:
D(f kg) ≈ DGoldberger (f kg) =
X
a
πa
D(fa kgm(a) ) + log
πa
ωm(a)
.
Analogous to the chain rule for relative entropy.
Information Theory and Applications, 2/1-2007 – p.15/31
Matched Bound Approximation
Match the closest pairs of gaussians
m(a) = argminb (D(fa kgb ) − log(ωb )).
Goldberger’s approximate formula is:
D(f kg) ≈ DGoldberger (f kg) =
X
a
πa
D(fa kgm(a) ) + log
πa
ωm(a)
.
Analogous to the chain rule for relative entropy.
Min Approximation:
D(f kg) ≈ mina,b D(fa kgb ) is an approximation
in the same spirit.
Information Theory and Applications, 2/1-2007 – p.15/31
Variational Approximation
Let 1 =
P
b φb|a
Z
be free parameters
f log g =
X
πa
Z
πa
Z
a
=
X
a
fa log
X
ωb gb
b
fa log
X
b
ωb gb
φb|a
φb|a
Introduce the variational parameters
Information Theory and Applications, 2/1-2007 – p.16/31
Variational Approximation
Let 1 =
P
b φb|a
Z
be free parameters
f log g =
X
πa
Z
fa log
πa
Z
X
a
≥
X
a
X
b
fa
b
ωb gb
φb|a
φb|a
ωb gb
φb|a log
φb|a
Use Jensen’s inequality to interchange log and
P
b φb|a
Information Theory and Applications, 2/1-2007 – p.16/31
Variational Approximation
Let 1 =
Z
P
b φb|a
be free parameters
f log g =
X
a
πa
Z
fa log
X
b
ωb gb
φb|a
φb|a
ωb gb
≥
πa fa
φb|a log
φb|a
a
b
Z
X X
=
πa
φb|a log(ωb /φb|a ) + fa log gb
X
a
Z
X
b
Simplify expression after using Jensen
Information Theory and Applications, 2/1-2007 – p.16/31
Variational Approximation
Let 1 =
Z
P
b φb|a
be free parameters
f log g =
X
a
πa
Z
fa log
X
b
ωb gb
φb|a
φb|a
ωb gb
≥
πa fa
φb|a log
φb|a
a
b
Z
X X
=
πa
φb|a log(ωb /φb|a ) + fa log gb
X
a
Z
X
b
Maximize over φb|a and do the same for
R
f log f to get
Information Theory and Applications, 2/1-2007 – p.16/31
Variational Approximation
Let 1 =
Z
P
b φb|a
be free parameters
f log g =
X
a
πa
Z
fa log
X
b
ωb gb
φb|a
φb|a
ωb gb
≥
πa fa
φb|a log
φb|a
a
b
Z
X X
=
πa
φb|a log(ωb /φb|a ) + fa log gb
X
a
Z
X
b
Maximize over φb|a and do the same for
D(f kg) ≈ Dvar (f kg) =
X
a
R
f log f to get
P
−D(fa kfa′ )
′e
π
′
a
πa log Pa
.
−D(f
kg
)
a
b
b ωb e
Information Theory and Applications, 2/1-2007 – p.16/31
Variational Upper Bound
Variational parameters:
Gaussian replication:
f
g
P
b φb|a
= πa and
P
a ψa|b
= ωb .
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb .
Information Theory and Applications, 2/1-2007 – p.17/31
Variational Upper Bound
Variational parameters:
Gaussian replication:
f
g
P
b φb|a
= πa and
P
a ψa|b
= ωb .
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb .
D(f kg) =
Z
f log(f /g)
P
Z
ψ
g
ab a|b b
= − f log
f
Introduce the variational parameters ψa|b .
Information Theory and Applications, 2/1-2007 – p.17/31
Variational Upper Bound
P
b φb|a
= πa and
D(f kg) = −
Z
f log
P
= −
Z
Variational parameters:
f log
P
a ψa|b
ab ψa|b gb
f
X φb|a fa ψa|b gb
ab
f
φb|a fa
= ωb .
!
dx
Prepare to use Jensen’s inequality.
Information Theory and Applications, 2/1-2007 – p.17/31
Variational Upper Bound
P
= πa and
D(f kg) = −
Z
= −
Z
P
ab φb|a fa
f log P
ab ψa|b gb
Variational parameters:
b φb|a
f log
P
a ψa|b
X φb|a fa ψa|b gb
= ωb .
f φb|a fa
Z
X
ψa|b gb
≤
dx
φb|a fa log
φb|a fa
ab
!
dx
ab
Interchange log and
φb|a fa
ab
f .
P
Information Theory and Applications, 2/1-2007 – p.17/31
Variational Upper Bound
Variational parameters:
P
b φb|a
= πa and
P
a ψa|b
= ωb .
P
ab φb|a fa
D(f kg) = − f log P
ab ψa|b gb
Z
X
ψa|b gb
≤
φb|a fa log
dx
φb|a fa
ab
X
φb|a D(fa kgb )
= D(φkψ) +
Z
ab
The chain rule for relative entropy for mixtures with unequal number of components!!
Information Theory and Applications, 2/1-2007 – p.17/31
Variational Upper Bound
Optimize variational bound
D(f kg) ≤ D(φkψ) +
with respect to constraints
P
b φb|a
X
ab
φb|a D(fa kgb )
= πa and
P
a ψa|b
= ωb .
Information Theory and Applications, 2/1-2007 – p.18/31
Variational Upper Bound
Optimize variational bound
D(f kg) ≤ D(φkψ) +
X
ab
φb|a D(fa kgb )
P
with respect to constraints b φb|a = πa and
Fix φ, find optimal value for ψ
ψa|b
P
a ψa|b
= ωb .
ωb φb|a
.
=P
′
φ
a′ b|a
Fix ψ , find optimal value for φ:
φb|a
πa ψa|b e−D(fa kgb )
=P
.
−D(f
kg
′)
a
b
b′ ψa|b′ e
Iterate a few times to find optimal solution!
Information Theory and Applications, 2/1-2007 – p.18/31
Comparison of KL divergence methods (1)
Histograms of difference between closed form approximations
and Monte Carlo Sampling with 1 million samples:
0.5
Probability
0.4
0.3
zero
product
min
gaussian
variational
0.2
0.1
0
−60
−50
−40
−30
−20
−10
Deviation from DMC(1M)
0
10
20
Information Theory and Applications, 2/1-2007 – p.19/31
Comparison of KL divergence methods (2)
Histograms of difference between various methods
and Monte Carlo Sampling with 1 million samples:
Probability
1
0.8
MC(2dn)
unscented
goldberger
variational
upper
0.6
0.4
0.2
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from DMC(1M)
Information Theory and Applications, 2/1-2007 – p.20/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
• Variational Approximation – cheap, good, closed form
Information Theory and Applications, 2/1-2007 – p.21/31
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
• Variational Approximation – cheap, good, closed form
• Variational Upper Bound – cheap, good, upper-bound
Information Theory and Applications, 2/1-2007 – p.21/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
Bhattacharyya distance
R√
f g , can be estimate
The Bhattacharyya distance, B(f, g) =
using Monte Carlo sampling from an arbitrary distribution h:
1
B(f, g) ≈ B̂h =
n
n
X
i=1
p
f (xi )g(xi )
,
h(xi )
where {xi }ni=1 are sampled from h.
Information Theory and Applications, 2/1-2007 – p.23/31
Bhattacharyya distance
R√
f g , can be estimate
The Bhattacharyya distance, B(f, g) =
using Monte Carlo sampling from an arbitrary distribution h:
1
B(f, g) ≈ B̂h =
n
n
X
i=1
p
f (xi )g(xi )
,
h(xi )
where {xi }ni=1 are sampled from h. The estimators are
unbiased, E[B̂h ] = B(f, g), with variance
Z
fg
1
var(B̂h ) =
− (B(f, g))2 .
n
h
Information Theory and Applications, 2/1-2007 – p.23/31
Bhattacharyya distance
R√
f g , can be estimate
The Bhattacharyya distance, B(f, g) =
using Monte Carlo sampling from an arbitrary distribution h:
1
B(f, g) ≈ B̂h =
n
n
X
i=1
p
f (xi )g(xi )
,
h(xi )
where {xi }ni=1 are sampled from h. The estimators are
unbiased, E[B̂h ] = B(f, g), with variance
Z
fg
1
var(B̂h ) =
− (B(f, g))2 .
n
h
h=
h=
1−(B(f,g))2
f gives var(B̂f ) =
.
n
2f g
−(B(f,g))2
f +g
f +g
f +g
.
2 gives var(B̂ 2 ) =
n
And var(B̂ f +g ) ≤ var(B̂f ) (Harmonic–Arithmetic inequality).
2
Information Theory and Applications, 2/1-2007 – p.23/31
Best sampling distribution
We can find the “best” sampling distribution h by minimizing the
R
variance of B̂h subject to the constraints h ≥ 0 and h = 1.
The solution is h =
√
√f g ,
fg
var(B̂h ) = 0.
Information Theory and Applications, 2/1-2007 – p.24/31
Best sampling distribution
We can find the “best” sampling distribution h by minimizing the
R
variance of B̂h subject to the constraints h ≥ 0 and h = 1.
The solution is h =
√
√f g ,
fg
var(B̂h ) = 0.
Unfortunately, using this h requires
• Computing the quantity, B(f, g) =
to compute in the first place.
√
• Sample from f g .
R√
f g , that we are trying
Information Theory and Applications, 2/1-2007 – p.24/31
Best sampling distribution
We can find the “best” sampling distribution h by minimizing the
R
variance of B̂h subject to the constraints h ≥ 0 and h = 1.
The solution is h =
√
√f g ,
fg
var(B̂h ) = 0.
Unfortunately, using this h requires
• Computing the quantity, B(f, g) =
to compute in the first place.
√
• Sample from f g .
R√
f g , that we are trying
We will use variational techniques to “approximate”
√
f g with
some unnormalized h that can be analytically integrated to give a
genuine pdf h.
Information Theory and Applications, 2/1-2007 – p.24/31
Bhattacharyya Variational Upper Bound
f
g
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb
P
b φb|a = πa
P
a ψa|b = ωb .
Information Theory and Applications, 2/1-2007 – p.25/31
Bhattacharyya Variational Upper Bound
f
g
P
P
P
=
a πa fa =
ab φb|a fa
b φb|a = πa
P
P
P
=
=
b ωb gb
ba ψa|b gb
a ψa|b = ωb .
Z r
g
B(f, g) =
f
f
sP
Z
ab ψa|b gb
=
f
f
Introduce the variational parameters ψa|b .
Information Theory and Applications, 2/1-2007 – p.25/31
Bhattacharyya Variational Upper Bound
f
g
P
b φb|a = πa
P
a ψa|b = ωb .
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb
B(f, g) =
Z
=
Z
f
sP
f
s
X φb|a fa ψa|b gb
ab ψa|b gb
ab
f
f
φb|a fa
dx
Prepare to use Jensen’s inequality.
Information Theory and Applications, 2/1-2007 – p.25/31
Bhattacharyya Variational Upper Bound
f
g
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb
B(f, g) =
Z
=
Z
P
b φb|a = πa
P
a ψa|b = ωb .
f
sP
f
s
X φb|a fa ψa|b gb
ab ψa|b gb
f
dx
f φb|a fa
s
Z
X
ψa|b gb
≥
φb|a fa
dx
φb|a fa
ab
ab
Interchange
√
· and
φb|a fa
ab
f .
P
Information Theory and Applications, 2/1-2007 – p.25/31
Bhattacharyya Variational Upper Bound
f
g
P
b φb|a = πa
P
a ψa|b = ωb .
P
P
=
a πa fa =
ab φb|a fa
P
P
=
=
b ωb gb
ba ψa|b gb
B(f, g) =
≥
f
sP
X
Z
Z
ab
φb|a
ab ψa|b gb
f
fa
s
ψa|b gb
dx
φb|a fa
Xq
φb|a ψa|b B(fa , gb )
=
ab
An inequality linking the mixture Bhattacharyya distance to the
component distances!!
Information Theory and Applications, 2/1-2007 – p.25/31
Bhattacharyya Variational Upper Bound
Optimize variational bound
Xq
B(f, g) ≥
φb|a ψa|b B(fa , gb )
ab
with respect to constraints
P
b φb|a
= πa and
P
a ψa|b
= ωb .
Information Theory and Applications, 2/1-2007 – p.26/31
Bhattacharyya Variational Upper Bound
Optimize variational bound
Xq
B(f, g) ≥
φb|a ψa|b B(fa , gb )
ab
P
with respect to constraints b φb|a = πa and
Fix φ, find optimal value for ψ
ψa|b
P
a ψa|b
= ωb .
ωb φb|a (B(fa , gb ))2
=P
.
2
a′ φb|a′ (B(fa′ , gb ))
Fix ψ , find optimal value for φ:
φb|a
πa ψa|b (B(fa , gb ))2
=P
.
2
b′ ψa|b′ (B(fa , gb′ ))
Iterate to find optimal solution!
Information Theory and Applications, 2/1-2007 – p.26/31
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
Xq
φb|a ψa|b B(fa , gb ) =
ab
Z Xq
φb|a ψa|b
p
fa gb =
Z
ĥ.
ab
q
P
Here ĥ =
φ
√
ψ
b|a a|b fa gb is an unnormalized approximation of the optimal
ab
R√
√
sampling distribution, f g/
f g.
Information Theory and Applications, 2/1-2007 – p.27/31
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
Xq
φb|a ψa|b B(fa , gb ) =
Z Xq
ab
φb|a ψa|b
p
fa gb =
Z
ĥ.
ab
q
P
Here ĥ =
φ
√
ψ
b|a a|b fa gb is an unnormalized approximation of the optimal
ab
R√
√
sampling distribution, f g/
f g.
R
R√
√
fa gb is a gaussian and
h = ĥ/ ĥ is a GMM since hab = fa gb /
h=
X
ab
q
πab hab ,
where
πab =
φb|a ψa|b
R√
fa gb
V (f, g)
Information Theory and Applications, 2/1-2007 – p.27/31
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
Xq
φb|a ψa|b B(fa , gb ) =
Z Xq
φb|a ψa|b
ab
p
fa gb =
Z
ĥ.
ab
q
P
Here ĥ =
φ
√
ψ
b|a a|b fa gb is an unnormalized approximation of the optimal
ab
R√
√
sampling distribution, f g/
f g.
R
R√
√
fa gb is a gaussian and
h = ĥ/ ĥ is a GMM since hab = fa gb /
h=
X
q
πab hab ,
where
πab =
φb|a ψa|b
ab
R√
fa gb
V (f, g)
Thus, drawing samples {xi }n
i=1 from h the estimate
V̂n =
n p
X
f (xi )g(xi )
1
n
i=1
h(xi )
is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2.
Information Theory and Applications, 2/1-2007 – p.27/31
Bhattacharyya Distance: Monte Carlo estimation
Bhattacharyya: Importance sampling from f (x)
3.5
Probability density
10
3
2.5
2
1.5
1
0.5
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Bhattacharyya: Importance sampling from f (x)
Probability density
3.5
10
100
3
2.5
2
1.5
1
0.5
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Bhattacharyya: Importance sampling from f (x)
Probability density
3.5
10
100
1000
3
2.5
2
1.5
1
0.5
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Bhattacharyya: Importance sampling from f (x)
Probability density
3.5
10
100
1000
10K
3
2.5
2
1.5
1
0.5
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Bhattacharyya: Importance sampling from f (x)
Probability density
3.5
10
100
1000
10K
100K
3
2.5
2
1.5
1
0.5
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Probability density
Importance sampling from
f (x)+g(x)
2
f 100K
10
15
10
5
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Probability density
Importance sampling from
f (x)+g(x)
2
f 100K
10
100
15
10
5
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Probability density
Importance sampling from
f (x)+g(x)
2
f 100K
10
100
1000
15
10
5
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Probability density
Importance sampling from
f (x)+g(x)
2
f 100K
10
100
1000
10K
15
10
5
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Probability density
Importance sampling from
f (x)+g(x)
2
f 100K
10
100
1000
10K
100K
15
10
5
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
(f+g)/2 100K
10
100
80
60
40
20
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
(f+g)/2 100K
10
100
100
80
60
40
20
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
(f+g)/2 100K
10
100
1000
100
80
60
40
20
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
(f+g)/2 100K
10
100
1000
10K
100
80
60
40
20
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
(f+g)/2 100K
10
100
1000
10K
100K
100
80
60
40
20
0
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Bhattacharyya Distance: Monte Carlo estimation
Variational importance sampling
Probability density
120
f 100K
(f+g)/2 100K
variational 100K
100
80
60
40
20
0
−0.05 −0.04 −0.03 −0.02 −0.01
0
0.01
0.02
0.03
0.04
0.05
Deviation from Bhattacharyya estimate with 1M samples
• Importance sampling from f (x): Slow convergence.
• Importance sampling from
f (x)+g(x)
:
2
Better convergence.
• Variational importance sampling: Fast convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
Variational Monte Carlo Sampling: KL-Divergence
Information Theory and Applications, 2/1-2007 – p.30/31
Future Directions
• HMM variational KL divergence
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
• Variational sampling of Bayes error using Chernoff
approximation
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
• Variational sampling of Bayes error using Chernoff
approximation
• Discriminative training, using Bhattacharyya divergence.
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
• Variational sampling of Bayes error using Chernoff
approximation
• Discriminative training, using Bhattacharyya divergence.
• Acoustic confusability using Bhattacharyya divergence
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
• Variational sampling of Bayes error using Chernoff
approximation
• Discriminative training, using Bhattacharyya divergence.
• Acoustic confusability using Bhattacharyya divergence
• Clustering of HMMs
Information Theory and Applications, 2/1-2007 – p.31/31
Future Directions
• HMM variational KL divergence
• HMM variational Bhattacharyya
• Variational Chernoff distance
• Variational sampling of Bayes error using Chernoff
approximation
• Discriminative training, using Bhattacharyya divergence.
• Acoustic confusability using Bhattacharyya divergence
• Clustering of HMMs
Information Theory and Applications, 2/1-2007 – p.31/31