EECS E6870-Speech Recognition Lecture 11

Outline of Today’s Lecture
EECS E6870 - Speech Recognition
Lecture 11
n
Administrivia
n
L ine ar D isc riminant Analy sis
n
M ax imu m M u tu al Info rmatio n T raining
Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran
n
R O VER
IBM T.J. Watson Research Center
n
C o nse nsu s D e c o ding
Yorktown Heights, NY, USA
Columbia University
[email protected], [email protected],
[email protected]
24 November 2009
E E CS E 6 8 7 0: Advanced Sp eech Recognition
E E C S E 6 8 7 0 : Advanc e d S p e e c h R e c o g nitio n
1
Administrivia
Linear Discriminant Analysis
http://www.ee.columbia.edu/ stanchen/fall09/e6870/readings/project f09.html
A way to achieve robustness is to extract features that
emphasize sound discriminability and ignore irrelevant sources of
information. LDA tries to achieve this via a linear transform of the
feature data.
If the main sources of class variation lie along the coordinate
axes there is no need to do anything even if assuming a diagonal
covariance matrix (as in most HMM models):
for suggested readings and presentation guidelines for fi nal
project.
E E C S E 6870: A dv anced Speech R ecognition
2
See
E E C S E 6 8 7 0 : Advanced S peech R ecognition
3
E E C S E 6 8 7 0 : A dvanced S p eech R ecognition
If the main sources of class variation do NOT lie along the main
source of variation we need to find the best directions:
If the main sources of class variation lie along the main source
of variation we may want to rotate the coordinate axis (if using
diagonal covariances):
Linear Discriminant Analysis - Motivation
Principle Component Analysis-Motivation
4
Eigenvectors and Eigenvalues
A key concept in feature selection are the eigenvalues and
eigenvectors of a matrix.
The eigenvalues and eigenvectors of a matrix are defined by the
following matrix equation:
E E C S E 6 8 7 0 : A dvanced S p eech R ecognition
5
this matrix is a polynomial (called the characteristic polynomial)
p(λ). The roots of this polynomial will be the eigenvalues of A.
For example, let us say
2 −4
.
A=
−1 −1
In such a case,
Ax = λx
= (2 − λ)(−1 − λ) − (−4)(−1)
= λ2 − λ − 6
= (λ − 3)(λ + 2)
Therefore, λ1 = 3 and λ2 = −2 are the eigenvalues of A.
To find the eigenvectors, we simply plug in the eigenvalues into
If xis non-zero, the only way this equation can be solved is if the
determinant of the matrix (A − λI) is zero. The determinant of
EEC S E6 8 7 0 : Advanced S peech R ecognition
6
For a given matrix A the eigenvectors are defined as those
vectors x for which the above statement is true. Each eigenvector
has an associated eigenvalue, λ. To solve this equation, we can
rewrite it as
(A − λI)x = 0
2−λ
−4
p(λ) = −1 −1 − λ
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
7
(A − λI)x = 0 and solve for x. For example, for λ1 = 3 we get
2−3
−4
−1 −1 − 3
x1
x2
=
0
0
Principle Component Analysis-Derivation
Solving this, we find that x1 = −4x2, so all the eigenvector
corresponding to λ1 = 3 is a multiple of [−4 1]T . Similarly, we
find that the eigenvector corresponding to λ1 = −2 is a multiple of
[1 1]T .
First consider the problem of best representing a set of vectors
x1, x2, . . . , xn by a single vector x0. More specifically let us try to
minimize the sum of the squared distances from x0
J0(x0) =
N
X
|xk − x0|2
k=1
It is easy to show that the sample mean, m, minimizes J0, where
m is given by
m = x0 =
N
1 X
xk
N
E E C S E 6 8 7 0 : A dvanced Speech R ecognition
k=1
8
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
9
mean square error:
J1(a1, a2, . . . , aN , e) =
N
X
|xk − (m + ak e)|2
k=1
If we differentiate the above with respect to ak we get
ak = eT (xk − m)
Now, let e be a unit vector in an arbitrary direction. In such a case,
we can express a vector x as
i.e. we project xk onto the line in the direction of e that passes
through the sample mean m. How do we find the best direction
e? If we substitute the above solution for ak into the formula for
the overall mean square error we get after some manipulation:
x = m + ae
T
J1(e) = −e Se +
E E C S E 6 8 7 0 : A dvanced S peech R ecog nition
10
|xk − m|2
k=1
For the vectors xk we can find a set of ak s that minimizes the
N
X
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
11
So to maximize eT Se we want to select the eigenvector of S
corresponding to the largest eigenvalue of S.
where S is called the Scatter matrix and is given by:
S=
N
X
(xk − m)(xk − m)T
k=1
Notice the scatter matrix just looks like N times the sample
covariance matrix of the data. To minimize J1 we want to
maximize eT Se subject to the constraint that |e| = eT e = 1. Using
Lagrange multipliers we write
u = eT Se − λeT e
. Differentiating u w.r.t e and setting to zero we get:
If we now want to fi nd the b est d directions, the prob lem is now to
express x as
2Se − 2λe = 0
or
x=m+
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
12
E E C S E 6 8 7 0 : A dvanced Speech R ecognition
In this case, we can write the mean square error as
Jd =
N
X
|(m +
k=1
d
X
a i ei
i=1
Se = λe
d
X
13
Linear Discriminant Analysis - Derivation
akiei) − xk |2
Let us say we have vectors corresponding to c classes of data.
We can define a set of scatter matrices as above as
i=1
and it is not hard to show that Jd is minimized when the vectors
e1, e2, . . . , ed correspond to the d largest eigenvectors of the
scatter matrix S.
Si =
X
(x − mi)(x − mi)T
x∈Di
where mi is the mean of class i. In this case we can define
the within-class scatter (essentially the average scatter across the
classes relative to the mean of each class) as just:
SW =
c
X
Si
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
14
i=1
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
15
Another useful scatter matrix is the between class scatter matrix,
defined as
SB =
c
X
We would like to determine a set of projection directions V
such that the classes c are maximally discriminable in the new
coordinate space given by
(mi − m)(mi − m)T
x̃ = Vx
E E C S E 6 8 7 0 : Adv anced S p eech R ecog nition
i=1
16
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
17
as:
S̃B =
c
X
=
c
X
(m̃i − m̃)(m̃i − m̃)T
i=1
V(mi − m)(mi − m)T VT
i=1
= VSB VT
A reasonable measure of discriminability is the ratio of the
volumes represented by the scatter matrices.
Since the
determinant of a matrix is a measure of the corresponding volume,
we can use the ratio of determinants as a measure:
J=
|SB |
|SW |
and similarly for SW so the discriminability measure becomes
J(V) =
W ith a little bit of manip ulation similar to that in P C A , it turns out
that the solution are the eig env ectors of the matrix
18
S−1
W SB
So we want to fi nd a set of directions that maximiz e this
expression. In the new space, we can write the above expression
E E C S E 6 8 7 0 : Advanced Speech R ecognition
|VSB VT |
|
|VSW VT
E E C S E 6 8 7 0 : A dv anced S p eech R ecog nition
19
which can be generated by most common mathematical
packages.
Linear Discriminant Analysis in Speech
Recognition
The most successful uses of LDA in speech recognition are
achieved in an interesting fashion.
n
n
E E C S E 6 8 7 0 : A dv anced S peech R ecognition
S peech recognition training data are aligned against the
underly ing w ords using the V iterb i alignment algorithm
describ ed in Lecture 4 .
U sing this alignment, each cepstral vector is tagged w ith a
different phone or sub -phone. F or E nglish this ty pically results
in a set of 1 5 6 (5 2 x 3 ) classes.
F or each time t the cepstral vector xt is spliced together w ith
N/2 vectors on the left and right to form a “supervector” of
N cepstral vectors. (N is ty pically 5 -9 frames.) C all this
“supervector” yt.
n
20
E E C S E 6 8 7 0 : Advanced S peech R ecognition
21
Training via Maximum Mutual Information
The F und ame ntal E q uation of S p e e c h R e c ognition states that
p(S|O) = p(O|S)p(S)/P (O)
where S is the sentence and O are our observations. We
model p(O|S) using Hidden Markov Models (HMMs). The HMMs
themselves have a set of parameters θ that are estimated from a
set of training data, so it is convenient to write this dependence
explicitly: pθ (O|S).
The LDA procedure is applied to the supervectors yt.
n
The top M direction s (usually 4 0 -6 0 ) are chosen an d the
supervectors yt are projected in to this low er dim en sion al space.
n
The recog n ition sy stem is retrain ed on these low er dim en sion al
vectors.
n
We estimate the parameters θ to maximize the likelihood of the
training data. Although this seems to make some intuitive sense,
is this what we are after?
Not really! (Why?). So then, why is ML estimation a good thing?
E E C S E 6 8 7 0 : Advan ced S peech R ecog n ition
22
P erform an ce im provem en ts of 1 0 % -1 5 % are ty pical.
n
E E C S E 6 8 7 0 : Advanced Speech R ecognition
23
Maximum Likelihood Estimation Redux
ML estimation results in a function that allows us to estimate
parameters of the desired distribution from observed samples of
the distribution. For example, in the Gaussian case:
1X
xk
µ̂MLE =
n
n
(xk − µ̂MLE)(xk − µ̂MLE)
n
n
T he family of distributions is well-behaved
n
T he samp le is larg e enoug h
then, the maximum likelihood estimator has a G aussian
distribution with the following g ood p rop erties:
T
k=1
n
More g enerally we can consider the estimate of the parameters θ
as a random variable. T he function that computes this estimate is
called an estimator.
n
n
S ince µ and Σ themselves are computed from the random
variables xk we can consider them to be random variables as well.
T he samp le is actually drawn from the assumed family of
distributions
EEC S E6 8 7 0 : A dvanced S peech R ecog nition
T he mean converg es to the true mean of the p arameters
(consistent)
T he variance has a p articular form and is just a function of
the true mean of the p arameters and the samp les (F isher
information)
N o other consistent estimator has a lower variance
1
Σ̂MLE =
n
k=1
n
X
Any estimator, maximum likelihood or other, since it is a random
variable, has a mean and a variance. It can be shown that if
24
E E C S E 6 8 7 0 : Advanced S p eech R ecog nition
25
This means the ML estimate on the average will produce the
closest estimate to the true parameters of the system.
Main Problem with Maximum Likelihood
Estimation
If we assume that the system has its best performance when the
parameters match the true parameters, then the ML estimate will,
on average, perform as good as or better than any other estimator.
The true distribution of speech is (probably) not generated by an
HMM, at least not of the type we are currently using. (How might
we demonstrate this?)
Therefore, the optimality of the ML estimate is not guaranteed and
the parameters estimated may not result in the lowest error rates.
A reasonable criterion is rather than maximizing the likelihood of
the data given the model, we try to maximize the a posteriori
probability of the model given the data (Why?):
θMAP = arg max pθ (S|O)
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
26
θ
E E C S E 6 8 7 0 : Advanced S peech R ecognition
27
MMI Estimation
Why is this Called MMI Estimation?
Let’s look at the previous equation in more detail. It is more
convenient to look at the problem as maximizing the logarithm
of the a posteriori probability across all the sentences:
There is a quantity in information theory called the Mutual
Information. It is defined as:
p(X, Y )
E log
p(X)p(Y )
θ
= arg max
θ
= arg max
θ
X
log pθ (Si|Oi)
i
X
i
pθ (Oi|Si)p(Si)
pθ (Oi)
i
pθ (Oi|Si)p(Si)
log P
j
j
j pθ (Oi|Si )p(Si )
X
log
S ince p(Si) does not dep end on θ, the term can b e drop p ed from
the p rev ious set of equations, in w hich case the estimation formula
look s lik e the ex p ression for mutual information, ab ov e.
w here Sij refers to the jth possible sentence hypothesis given a
set of acoustic observations Oi
W hen orig inally deriv ed b y B row n[1 ], the formulation w as actually
in terms of mutual information, hence the name. H ow ev er, it is
easier to quick ly motiv ate in terms of max imiz ing the a p osteriori
p rob ab ility of the answ ers.
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
θMMI = arg max
28
E E C S E 6 8 7 0 : A dv anced S p eech R ecog nition
Comparison to ML Estimation
MMI Training Algorithm
i
Therefore, in ML estimation, for each i we only need to make
computations over the correct sentence Si. In MMI estimation,
we need to worry about computing quanitities over all possibile
sentence hypotheses - a much more computationally intense
process.
The MMI objective function is
X
i
30
pθ (Oi|Si)p(Si)
log P
j
j
j pθ (Oi|Si )p(Si )
We can view this as comprising two terms, the numerator, and the
denominator. We can increase the objective function in two ways:
n
Increase the contribution from the numerator term
n
D ecrease the contribution from the denominator term
Another advantage of ML over MMI is that there exists a
relatively simple algorithm - the forward-backward, or BaumWelch, algorithm, for iteratively estimating θ that is guaranteed
to converge. When originally formulated, MMI training had to be
done by painful gradient search.
A big breakthrough in the MMI area occured when it was shown
that a forward-backward-like algorithm existed for MMI training [2].
The derivation is complex but the resulting esitmation formulas are
surprisingly simple. We will just give the results for the estimation
of the means in a Gaussian HMM framework.
In ordinary ML estimation, the objective is to find θ :
X
log pθ (Oi|Si)
θML = arg max
θ
E E C S E 6 8 7 0 : Advanced S peech R ecognition
29
E E C S E 6 8 7 0 : Advanced S peech R ecognition
31
Basic idea:
estimate for µmk is:
C o llect estim atio n co u n ts fro m
den o m in ato r term s
n
b o th
th e n u m erato r an d
µkm =
In crease th e o b jectiv e fu n ctio n b y su b tractin g th e den o m in ato r
co u n ts fro m th e n u m erato r co u n ts.
n
M o re sp ecifi cally , let:
num
=
θmk
X
den
=
θmk
X
num
Oi(t)γmki
(t)
num
den
θmk
− θmk
+ Dmk µ0mk
num − γ den + D
γmk
mk
mk
T h e fac tor Dmk is c h ose larg e en ou g h to av oid p rob lems w ith
n eg ativ e c ou n t d ifferen c es. N otic e th at ig n orin g th e d en omin ator
c ou n ts resu lts in th e n ormal mean estimate. A similar ex p ression
ex ists for v arian c e estimation .
i,t
den
Oi(t)γmki
(t)
i,t
32
Computing the Denominator Counts
The major component of the MMI calculation is the computation of
the denominator counts. Theoretically, we must compute counts
for every possible sentence hypotheis. How can we reduce the
amount of computation?
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
34
33
Counts can be collected on this HMM the same way counts are
collected on the HMM representing the sentence corresponding
to the correct path.
2. Use a ML decoder to generate a “reasonable” number of
sentence hypotheses and then use FST operations such as
determinization and minimization to compactify this into an HMM
graph (lattice).
3. Do not regenerate the lattice after every MMI iteration.
1. From the previous lectures, realize that the set of sentence
hypotheses are just captured by a large HMM for the entire
sentence:
E E C S E 6 8 7 0 : A d v an c ed S p eec h R ec og n ition
E E C S E 6 8 7 0 : A dv an ced S p eech R eco g n itio n
num
w h ere γmki
(t) are th e co u n ts fo r state k, m ix tu re co m p o n en t
m, co m p u ted fro m ru n n in g th e fo rw ard-b ack w ard alg o rith m o n
den
th e “co rrect” sen ten ce Si an d γmki
(t) are th e co u n ts co m p u ted
acro ss all th e sen ten ce h y p o th eses co rresp o n din g to Si T h e M M I
E E CS E 6 8 7 0 : A dvanced Speech R ecognition
35
Results
Other Computational Issues
Because we ignore correlation, the likelihood of the data tends
to be dominated by a very small number of lattice paths (Why?).
To increase the number of confusable paths, the likelihoods are
scaled with an exponential constant:
X
i
pθ (Oi|Si)κp(Si)κ
log P
j κ
j κ
j pθ (Oi|Si ) p(Si )
F or similar reasons, a weaker language model (unigram) is
used to generate the denominator lattice. This also simplifi es
denominator lattice generation.
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
Note that results hold up on a variety of other tasks as well.
36
E E C S E 6 8 7 0 : A dvanc ed S peec h R ec og nition
MPE
Variations and Embellishments
MPE - Minimum Phone Error
n
b MMI - B oos ted MMI
n
MC E - Minimum C la s s ifi c a tion Error
n
fMPE/fMMI - fea ture-b a s ed MPE a nd MMI
X
i
n
n
n
EEC S E6 8 7 0 : A d v a nc ed S p eec h R ec og nition
38
P
j
pθ (Oi|Sj )κp(Sj )κA(Sref , Sj )
P
j κ
j κ
j pθ (Oi|Si ) p(Si )
A(Sref , Sj is a phone-frame accuracy function. A measures the
number of correctly labeled frames in S
P ov ey [3 ] show ed this could be optimiz ed in a w ay similar to that
of M M I.
U sually w ork s somew hat better than M M I itself
n
37
E E C S E 6 8 7 0 : A dv anced S peech R ecog nition
39
bMMI
i
log P
j
pθ (Oi|Sij )κp(Sij )κ e x p (−bA(Sij , Sr e f ))
B oosts contrib ution of paths w ith low er phone error rates.
A is a phone-frame accuracy function as in MPE.
n
n
Language
A rab ic
D o m ain
T elep h o ny
H o urs
80
M L
4 3 .2
M PE
3 6 .8
bM M I
3 5 .9
pθ (Oi|Si)κp(Si)κ
EEC S E6 8 7 0 : A d v anced S peech R ecog nition
40
pθ (Oi|Si)κp(Si)κ
j
pθ (Oi|Sij )κp(Sij )κ e x p (−bA (Sij , Si)
y t = O t + M ht
)
1
1+ e2ρ x
T he s u m o v er c o m p etin g m o d els ex p lic itly ex c lu d es the c o rrec t
c la s s (u n lik e the o ther v a ria tio n s )
n
O rig in a lly d ev elo p ed fo r g ra m m a r-b a s ed a p p lic a tio n s
n
C o m p a ra b le to M P E , n ev er c o m p a red to b M M I
n
n
n
A p p ro x im a tes s en ten c e erro r ra te o n tra in in g d a ta
n
E E C S E 6 8 7 0 : A d v a n c ed S p eec h R ec o g n itio n
41
fMPE/fMMI
n
where f (x) =
E nglis h
P arliam ent
80
8 .8
7 .2
6 .8
42
ht are the set of Gaussian likelihoods for frame t. May be
clustered into a smaller number of Gaussians, may also be
combined across multiple frames.
T he training of M is ex ceeding ly complex inv olv ing both the
g radients of your fav orite objectiv e function w ith respect to M
as w ell as the model parameters θ w ith multiple passes throug h
the data.
R ather amaz ing ly g iv es sig nifi cant g ains both w ith and w ithout
MMI.
i
f (log P
E nglis h
T elep h o ny
175
3 1 .8
2 8 .6
2 8 .3
E E C S E 6 8 7 0 : A d v anc ed S p eec h R ec o gnitio n
MCE
X
E nglis h
N ew s
50
2 5 .3
1 9 .6
1 8 .1
X
Various Comparisons
E E C S E 6 8 7 0 : A dv anced S peech R ecog nition
43
References
fMPE/fMMI Results
English BN 50 Hours, SI models
[1] P. Brown (1987) “The Acoustic Modeling Problem in
Automatic Speech Recognition”, PhD Thesis, Dept. of
Computer Science, Carnegie-Mellon University.
D EV 04 f R T 04
2 8 .7
2 5.3
2 1 .8
1 9 .2
2 1 .1
1 8 .2
[2] P.S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo
(1991) “ An Inequality for Rational Functions with
Applications to Some Statistical Modeling Problems”,
IEEE Trans. on Acoustics, Speech and Signal Processing,
37(1) 107-113, January 1991
A ra b ic BN 1 4 00 Hours, SA T M odels
[3] D. Povey and P. Woodland (2002) “Minimum Phone Error
and i-smoothing for improved discriminative training”, Proc.
ICASSP vol. 1 pp 105-108.
EV A L 06
2 4 .9
2 2 .3
2 0.1
EV A L 07
1 9 .6
1 6 .8
1 4 .5
D EV 07
M L
1 7 .1
fM P E
1 4 .3
fM P E+ M P E
1 2 .6
EEC S E6 8 7 0: A dv a nc ed Sp eec h R ec ognition
R T 03
M L
1 7 .5
fBM M I
1 3 .2
fb M M I+ b M M I 1 2 .6
44
EECS E6 870: Advanced Speech Recognition
ROVER - Recognizer Output Voting Error
Reduction[1]
45
ROVER - Basic Architecture
ROVER is a technique for combining recognizers together to
improve recognition accuracy. The concept came from the
following set of observations about 11 years ago:
n
C ompare errors of recognizers from two d ifferent sites
n
Error rate performance similar - 4 4 .9 % vs 4 5 .1%
Out of 5 9 19 total errors, 7 3 8 are errors for only recognizer A
and 7 5 5 for only recognizer B
n
Systems may come from multiple sites
n
C an b e a sin g le site w ith d ifferen t processin g sch emes
S uggests that some sort of voting process across recognizers
might red uce the overall error rate
n
EEC S E6 8 7 0 : A d vanced S peech Recognition
46
n
E E C S E 6 8 7 0 : A d v an ced Speech R ecog n ition
47
EECS E6870: Advanced Speech Recognition
ROVER - Example
ROVER - Text String Alignment Process
48
EECS E6870: Advanced Speech Recognition
ROVER - Form Confusion Sets
49
ROVER - Aligning Strings Against a Network
EECS E6870: Advanced Speech Recognition
50
Solution: Alter cost function so that there is only a substitution
cost if no member of the reference network matches the target
symbol.
E E C S E 6 8 7 0 : Ad v anced Sp eech R ecognition
51
ROVER - Aligning Networks Against Networks
ROVER - Vote
n
Main Idea: for each confusion set, take word with highest
frequency
SYS1
4 4 .9
n
SYS2
4 5 .1
SYS3
4 8 .7
SYS4
4 8 .9
SYS5
5 0 .2
ROVER
3 9 .7
Im p rov em ent v ery im p ressiv e - as large as any signifi cant
algorithm adv ance.
EEC S E6 8 7 0 : A dvanced Speech Recog nition
No so much a ROVER issue but will be important for confusion
networks.
Problem: How to score relative probabilities and deletions?
Solution: cost subst(s1,s2)= (1 - p1(winner(s2)) + 1 - p2(winner(s1))/2
52
E E C S E 6 8 7 0 : A dv anced S p eech R ecognition
ROVER - Example
53
ROVER - As a Function of Number of Systems [2]
Error not guaranteed to be reduced.
n
Alphabetical: take systems in alphabetical order.
n
C u rv es ordered by error rate.
n
N ote error actu ally g oes u p slig htly w ith 9 systems
EEC S E6 8 7 0 : A dv anced S p eech R ecognition
54
n
S ens itiv e to initial ch oice of bas e s y s tem us ed for alignm ent ty p ically tak e th e bes t s y s tem .
n
E E C S E 6 8 7 0 : Adv anced S peech R ecog nition
55
ROVER - Types of Systems to Combine
References
ML and MMI
n
[1] J. Fiscus (1997) “A Post-Processing System to Yield
Reduced Error Rates”, IEEE Workshop on Automatic Speech
Recognition and Understanding, Santa Barbara, CA
V ary ing am o u nt o f ac o u s tic c o nte x t in p ro nu nc iatio n m o de ls
(T rip h o ne , Q u inp h o ne )
n
n
D iffe re nt le x ic o ns
n
D iffe re nt s ig nal p ro c e s s ing s c h e m e s (MF C C , P LP )
n
A ny th ing e ls e y o u c an th ink o f!
[2] H. Schwenk and J.L. Gauvain (2000) “Combining Multiple
Speech Recognizers using Voting and Language Model
Information” ICSLP 2000, Beijing II pp. 915-918
E E C S E 6 8 7 0 : A dv anc e d S p e e c h R e c o g nitio n
R o v e r p ro v ide s an e x c e lle nt w ay to ac h ie v e c ro s s -s ite c o llab o ratio n
and s y ne rg y in a re lativ e ly p ainle s s fas h io n.
56
EECS E6 870: Advanced Speech Recognition
Consensus Decoding[1] - Introduction
57
Consensus Decoding - Motivation
P rob lem
n
Standard SR evaluation procedure is word-based
n
Standard h y poth esis scoring functions are sentence-based
G oa l
E x plicitly m inim iz e word error m etric:
n
c = arg min EP (R|A)[WE (W, R)] = arg min
W
W
W
X
P (R|A)WE (W, R)
R
Original work was done off N-best lists
n
L attic es m u c h m ore c om p ac t and h av e lower orac le error rates
E E C S E 6 8 7 0 : A dvanced Speech Recog nition
58
n
F or each candidate word, sum th e word posteriors and pick th e
word with th e h ig h est posterior probability .
n
E E C S E 6 8 7 0 : A dv anc ed S p eec h R ec ognition
59
Consensus Decoding - Approach
Consensus Decoding Approach - cont’d
Find a multiple alignment of all the lattice paths
Input L attice:
HAVE
n
Compute the word error between two hypotheses according to
the multiple alignment:
VERY
OFTEN
I
I
W E(W, R) ≈ M W E(W, R)
VERY
MOVE
OFTEN
I
VEAL
HAVE
FINE
SIL
SIL
HAVE
SIL
FAST
VERY
SIL
VERY
MOVE
HAVE
SIL
FINE
VERY
IT
n
F ind the consensus hypothesis:
SIL
IT
WC = a rg m in
M ultiple Alignm ent:
W ∈W
I
HAVE
IT
VEAL
FINE
-
MOVE
-
VERY
OFTEN
X
P (R|A) ∗ M W E(W, R)
R∈W
E E C S E 6 8 7 0 : A dv anced S peech R ecognition
FAST
60
E E CS E 6 8 7 0 : A dv anced S peech R ecognition
Consensus Decoding Approach - cont’d
Consensus Decoding Approach - Multiple
Alignment
Input:
H AVE
VERY
O F T EN
I
I
M O VE
n
Equivalence relation over word hypotheses (links)
n
T otal ordering of the equivalence classes
VERY
O F T EN
I
VEAL
H AVE
F IN E
SIL
SIL
H AVE
F IN E
VERY
IT
H AVE
Mathematical prob lem formulation:
F AST
VERY
M O VE
SIL
SIL
VERY
SIL
61
SIL
IT
n
O utput:
I
HAVE
IT
VEAL
F IN E
-
M O VE
-
VERY
O FTEN
n
D efi ne a partial order on sets of links which is consistent with
the precedence order in the lattice
C luster sets of links in the partial order to derive a total order
EECS E6870: Advanced Speech Recognition
62
FA ST
EEC S E6 8 7 0 : A dvanced S peech R ecog nition
63
Consensus Decoding Approach - Clustering
Algorithm
Obtaining the Consensus Hypothesis
Input:
Initializ e Clusters: a cluster consists of all the links having the
same starting time, ending time and word label
H AVE
VERY
O F T EN
I
M O VE
I
VERY
O F T EN
I
VEAL
H AVE
F IN E
SIL
SIL
H AVE
VERY
IT
SIL
F AST
VERY
SIL
Intra-w ord Clustering: merge only clusters which are not in
relation and correspond to the same word
SIL
IT
Output:
I
HAVE
(0.45)
IT (0.3 9 )
-
M O VE
(0.55)
-
(0.6 1 )
VEAL
F IN E
VERY
O FTEN
FA ST
VERY
M O VE
H AVE
Inter-w ord Clustering: merge heterogeneous clusters which are
not in relation
E E C S E 6 8 7 0 : A dvanced S peech R ecognition
SIL
F IN E
64
EECS E6870: Advanced Speech Recognition
Confusion Networks
65
Consensus Decoding on DARPA Communicator
24
I
HAVE
(0.45)
IT (0.3 9 )
VEAL
F IN E
-
M O VE
(0.55)
-
VERY
O FTEN
(0.6 1 )
LA R G E
S M A LL
LA R G E
S M A LL
LA R G E
S M A LL
22
n
Confidence Annotations and Word Spotting
n
Sy stem Com b ination
n
E rror Correction
W o rd E rro r R a te (% )
FA ST
sLM
sLM
sLM
sLM
sLM
sLM
2
2
2+C
2+C
2+C +M X
2+C +M X
20
18
16
14
40K
7 0K
2 8 0K
40K M L L R
E E CS E 6 8 7 0 : Adv anced Speech R ecognition
66
A c o u s tic M o d e l
EECS E6870: Advanced Speech Recognition
67
F0
8 .3
8 .5
C C +
A vg
1 4 .0
1 3 .6
F0
8 .6
8 .5
Word Error Rate (%)
F1
F2
F3
F4
1 5 .8 1 9 .4 1 5 .3 1 6 .0
1 5 .7 1 8 .6 1 4 .6 1 5 .3
F5
2 2 .4
1 8 .8
FX
2 3 .7
2 2 .5
F5
5 .7
5 .7
FX
4 4 .8
4 1 .1
EEC S E6 8 7 0 : A dv an c ed S p eec h Rec og n ition
A vg
1 6 .5
1 6 .0
Word Error Rate (%)
S y s tem
B as elin e C on s en s u s
S -V M 1
3 0 .2
2 8 .8
S -V M 2
3 3 .7
3 1 .2
S -V M 3
4 2 .4
4 1 .6
RO V ER
2 9 .2
2 8 .5
C C +
Word Error Rate (%)
F1
F2
F3
F4
1 8 .6 2 7 .9 2 6 .2 1 0 .7
1 8 .1 2 6 .1 2 5 .8 1 0 .5
Consensus Decoding on Voice Mail
Consensus Decoding on Broadcast News
68
EEC S E6 8 7 0 : A dv an c ed S p eec h Rec og n ition
System Combination Using Confusion Networks
69
System Combination Using Confusion Networks
If we have multiple systems, we can combine the concept of
ROVER with confusion networks as follows:
U se the same process as ROVER to alig n confusion networks
n
T ake the overall confusion network and ad d the posterior
probabilities for each word .
n
EEC S E6 8 7 0 : A d vanced S peech Recog nition
70
F or each confusion set, pick the word with the hig hest summed
posteriors.
n
EECS E6870: Advanced Speech Recognition
71
References
EECS E6870: Advanced Speech Recognition
72
COURSE FEEDBACK
Was this lecture mostly clear or unclear?
muddiest topic?
n
What was the
O ther feedb ack (pace, content, atmosphere)?
n
E E C S E 6 8 7 0 : A dv anced S peech R ecog nition
74
[1] L. Mangu, E. Brill and A. Stolcke (2000) “Finding consensus
in speech recognition: word error minimization and other
applications of confusion networks”, Computer Speech and
Language 14(4), 373-400.
Results of Confusion-Network-Based System
Combination
EECS E6 8 70: Adv anced Speech R ecognition
73