Outline of Today’s Lecture EECS E6870 - Speech Recognition Lecture 11 n Administrivia n L ine ar D isc riminant Analy sis n M ax imu m M u tu al Info rmatio n T raining Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran n R O VER IBM T.J. Watson Research Center n C o nse nsu s D e c o ding Yorktown Heights, NY, USA Columbia University [email protected], [email protected], [email protected] 24 November 2009 E E CS E 6 8 7 0: Advanced Sp eech Recognition E E C S E 6 8 7 0 : Advanc e d S p e e c h R e c o g nitio n 1 Administrivia Linear Discriminant Analysis http://www.ee.columbia.edu/ stanchen/fall09/e6870/readings/project f09.html A way to achieve robustness is to extract features that emphasize sound discriminability and ignore irrelevant sources of information. LDA tries to achieve this via a linear transform of the feature data. If the main sources of class variation lie along the coordinate axes there is no need to do anything even if assuming a diagonal covariance matrix (as in most HMM models): for suggested readings and presentation guidelines for fi nal project. E E C S E 6870: A dv anced Speech R ecognition 2 See E E C S E 6 8 7 0 : Advanced S peech R ecognition 3 E E C S E 6 8 7 0 : A dvanced S p eech R ecognition If the main sources of class variation do NOT lie along the main source of variation we need to find the best directions: If the main sources of class variation lie along the main source of variation we may want to rotate the coordinate axis (if using diagonal covariances): Linear Discriminant Analysis - Motivation Principle Component Analysis-Motivation 4 Eigenvectors and Eigenvalues A key concept in feature selection are the eigenvalues and eigenvectors of a matrix. The eigenvalues and eigenvectors of a matrix are defined by the following matrix equation: E E C S E 6 8 7 0 : A dvanced S p eech R ecognition 5 this matrix is a polynomial (called the characteristic polynomial) p(λ). The roots of this polynomial will be the eigenvalues of A. For example, let us say 2 −4 . A= −1 −1 In such a case, Ax = λx = (2 − λ)(−1 − λ) − (−4)(−1) = λ2 − λ − 6 = (λ − 3)(λ + 2) Therefore, λ1 = 3 and λ2 = −2 are the eigenvalues of A. To find the eigenvectors, we simply plug in the eigenvalues into If xis non-zero, the only way this equation can be solved is if the determinant of the matrix (A − λI) is zero. The determinant of EEC S E6 8 7 0 : Advanced S peech R ecognition 6 For a given matrix A the eigenvectors are defined as those vectors x for which the above statement is true. Each eigenvector has an associated eigenvalue, λ. To solve this equation, we can rewrite it as (A − λI)x = 0 2−λ −4 p(λ) = −1 −1 − λ E E C S E 6 8 7 0 : A dvanced S peech R ecognition 7 (A − λI)x = 0 and solve for x. For example, for λ1 = 3 we get 2−3 −4 −1 −1 − 3 x1 x2 = 0 0 Principle Component Analysis-Derivation Solving this, we find that x1 = −4x2, so all the eigenvector corresponding to λ1 = 3 is a multiple of [−4 1]T . Similarly, we find that the eigenvector corresponding to λ1 = −2 is a multiple of [1 1]T . First consider the problem of best representing a set of vectors x1, x2, . . . , xn by a single vector x0. More specifically let us try to minimize the sum of the squared distances from x0 J0(x0) = N X |xk − x0|2 k=1 It is easy to show that the sample mean, m, minimizes J0, where m is given by m = x0 = N 1 X xk N E E C S E 6 8 7 0 : A dvanced Speech R ecognition k=1 8 E E C S E 6 8 7 0 : A dvanced S peech R ecognition 9 mean square error: J1(a1, a2, . . . , aN , e) = N X |xk − (m + ak e)|2 k=1 If we differentiate the above with respect to ak we get ak = eT (xk − m) Now, let e be a unit vector in an arbitrary direction. In such a case, we can express a vector x as i.e. we project xk onto the line in the direction of e that passes through the sample mean m. How do we find the best direction e? If we substitute the above solution for ak into the formula for the overall mean square error we get after some manipulation: x = m + ae T J1(e) = −e Se + E E C S E 6 8 7 0 : A dvanced S peech R ecog nition 10 |xk − m|2 k=1 For the vectors xk we can find a set of ak s that minimizes the N X E E C S E 6 8 7 0 : A dvanced S peech R ecognition 11 So to maximize eT Se we want to select the eigenvector of S corresponding to the largest eigenvalue of S. where S is called the Scatter matrix and is given by: S= N X (xk − m)(xk − m)T k=1 Notice the scatter matrix just looks like N times the sample covariance matrix of the data. To minimize J1 we want to maximize eT Se subject to the constraint that |e| = eT e = 1. Using Lagrange multipliers we write u = eT Se − λeT e . Differentiating u w.r.t e and setting to zero we get: If we now want to fi nd the b est d directions, the prob lem is now to express x as 2Se − 2λe = 0 or x=m+ E E C S E 6 8 7 0 : A dvanced S peech R ecognition 12 E E C S E 6 8 7 0 : A dvanced Speech R ecognition In this case, we can write the mean square error as Jd = N X |(m + k=1 d X a i ei i=1 Se = λe d X 13 Linear Discriminant Analysis - Derivation akiei) − xk |2 Let us say we have vectors corresponding to c classes of data. We can define a set of scatter matrices as above as i=1 and it is not hard to show that Jd is minimized when the vectors e1, e2, . . . , ed correspond to the d largest eigenvectors of the scatter matrix S. Si = X (x − mi)(x − mi)T x∈Di where mi is the mean of class i. In this case we can define the within-class scatter (essentially the average scatter across the classes relative to the mean of each class) as just: SW = c X Si E E C S E 6 8 7 0 : A dvanced S peech R ecognition 14 i=1 E E C S E 6 8 7 0 : A dvanced S peech R ecognition 15 Another useful scatter matrix is the between class scatter matrix, defined as SB = c X We would like to determine a set of projection directions V such that the classes c are maximally discriminable in the new coordinate space given by (mi − m)(mi − m)T x̃ = Vx E E C S E 6 8 7 0 : Adv anced S p eech R ecog nition i=1 16 E E C S E 6 8 7 0 : A dvanced S peech R ecognition 17 as: S̃B = c X = c X (m̃i − m̃)(m̃i − m̃)T i=1 V(mi − m)(mi − m)T VT i=1 = VSB VT A reasonable measure of discriminability is the ratio of the volumes represented by the scatter matrices. Since the determinant of a matrix is a measure of the corresponding volume, we can use the ratio of determinants as a measure: J= |SB | |SW | and similarly for SW so the discriminability measure becomes J(V) = W ith a little bit of manip ulation similar to that in P C A , it turns out that the solution are the eig env ectors of the matrix 18 S−1 W SB So we want to fi nd a set of directions that maximiz e this expression. In the new space, we can write the above expression E E C S E 6 8 7 0 : Advanced Speech R ecognition |VSB VT | | |VSW VT E E C S E 6 8 7 0 : A dv anced S p eech R ecog nition 19 which can be generated by most common mathematical packages. Linear Discriminant Analysis in Speech Recognition The most successful uses of LDA in speech recognition are achieved in an interesting fashion. n n E E C S E 6 8 7 0 : A dv anced S peech R ecognition S peech recognition training data are aligned against the underly ing w ords using the V iterb i alignment algorithm describ ed in Lecture 4 . U sing this alignment, each cepstral vector is tagged w ith a different phone or sub -phone. F or E nglish this ty pically results in a set of 1 5 6 (5 2 x 3 ) classes. F or each time t the cepstral vector xt is spliced together w ith N/2 vectors on the left and right to form a “supervector” of N cepstral vectors. (N is ty pically 5 -9 frames.) C all this “supervector” yt. n 20 E E C S E 6 8 7 0 : Advanced S peech R ecognition 21 Training via Maximum Mutual Information The F und ame ntal E q uation of S p e e c h R e c ognition states that p(S|O) = p(O|S)p(S)/P (O) where S is the sentence and O are our observations. We model p(O|S) using Hidden Markov Models (HMMs). The HMMs themselves have a set of parameters θ that are estimated from a set of training data, so it is convenient to write this dependence explicitly: pθ (O|S). The LDA procedure is applied to the supervectors yt. n The top M direction s (usually 4 0 -6 0 ) are chosen an d the supervectors yt are projected in to this low er dim en sion al space. n The recog n ition sy stem is retrain ed on these low er dim en sion al vectors. n We estimate the parameters θ to maximize the likelihood of the training data. Although this seems to make some intuitive sense, is this what we are after? Not really! (Why?). So then, why is ML estimation a good thing? E E C S E 6 8 7 0 : Advan ced S peech R ecog n ition 22 P erform an ce im provem en ts of 1 0 % -1 5 % are ty pical. n E E C S E 6 8 7 0 : Advanced Speech R ecognition 23 Maximum Likelihood Estimation Redux ML estimation results in a function that allows us to estimate parameters of the desired distribution from observed samples of the distribution. For example, in the Gaussian case: 1X xk µ̂MLE = n n (xk − µ̂MLE)(xk − µ̂MLE) n n T he family of distributions is well-behaved n T he samp le is larg e enoug h then, the maximum likelihood estimator has a G aussian distribution with the following g ood p rop erties: T k=1 n More g enerally we can consider the estimate of the parameters θ as a random variable. T he function that computes this estimate is called an estimator. n n S ince µ and Σ themselves are computed from the random variables xk we can consider them to be random variables as well. T he samp le is actually drawn from the assumed family of distributions EEC S E6 8 7 0 : A dvanced S peech R ecog nition T he mean converg es to the true mean of the p arameters (consistent) T he variance has a p articular form and is just a function of the true mean of the p arameters and the samp les (F isher information) N o other consistent estimator has a lower variance 1 Σ̂MLE = n k=1 n X Any estimator, maximum likelihood or other, since it is a random variable, has a mean and a variance. It can be shown that if 24 E E C S E 6 8 7 0 : Advanced S p eech R ecog nition 25 This means the ML estimate on the average will produce the closest estimate to the true parameters of the system. Main Problem with Maximum Likelihood Estimation If we assume that the system has its best performance when the parameters match the true parameters, then the ML estimate will, on average, perform as good as or better than any other estimator. The true distribution of speech is (probably) not generated by an HMM, at least not of the type we are currently using. (How might we demonstrate this?) Therefore, the optimality of the ML estimate is not guaranteed and the parameters estimated may not result in the lowest error rates. A reasonable criterion is rather than maximizing the likelihood of the data given the model, we try to maximize the a posteriori probability of the model given the data (Why?): θMAP = arg max pθ (S|O) E E C S E 6 8 7 0 : A dvanced S peech R ecognition 26 θ E E C S E 6 8 7 0 : Advanced S peech R ecognition 27 MMI Estimation Why is this Called MMI Estimation? Let’s look at the previous equation in more detail. It is more convenient to look at the problem as maximizing the logarithm of the a posteriori probability across all the sentences: There is a quantity in information theory called the Mutual Information. It is defined as: p(X, Y ) E log p(X)p(Y ) θ = arg max θ = arg max θ X log pθ (Si|Oi) i X i pθ (Oi|Si)p(Si) pθ (Oi) i pθ (Oi|Si)p(Si) log P j j j pθ (Oi|Si )p(Si ) X log S ince p(Si) does not dep end on θ, the term can b e drop p ed from the p rev ious set of equations, in w hich case the estimation formula look s lik e the ex p ression for mutual information, ab ov e. w here Sij refers to the jth possible sentence hypothesis given a set of acoustic observations Oi W hen orig inally deriv ed b y B row n[1 ], the formulation w as actually in terms of mutual information, hence the name. H ow ev er, it is easier to quick ly motiv ate in terms of max imiz ing the a p osteriori p rob ab ility of the answ ers. E E C S E 6 8 7 0 : A dvanced S peech R ecognition θMMI = arg max 28 E E C S E 6 8 7 0 : A dv anced S p eech R ecog nition Comparison to ML Estimation MMI Training Algorithm i Therefore, in ML estimation, for each i we only need to make computations over the correct sentence Si. In MMI estimation, we need to worry about computing quanitities over all possibile sentence hypotheses - a much more computationally intense process. The MMI objective function is X i 30 pθ (Oi|Si)p(Si) log P j j j pθ (Oi|Si )p(Si ) We can view this as comprising two terms, the numerator, and the denominator. We can increase the objective function in two ways: n Increase the contribution from the numerator term n D ecrease the contribution from the denominator term Another advantage of ML over MMI is that there exists a relatively simple algorithm - the forward-backward, or BaumWelch, algorithm, for iteratively estimating θ that is guaranteed to converge. When originally formulated, MMI training had to be done by painful gradient search. A big breakthrough in the MMI area occured when it was shown that a forward-backward-like algorithm existed for MMI training [2]. The derivation is complex but the resulting esitmation formulas are surprisingly simple. We will just give the results for the estimation of the means in a Gaussian HMM framework. In ordinary ML estimation, the objective is to find θ : X log pθ (Oi|Si) θML = arg max θ E E C S E 6 8 7 0 : Advanced S peech R ecognition 29 E E C S E 6 8 7 0 : Advanced S peech R ecognition 31 Basic idea: estimate for µmk is: C o llect estim atio n co u n ts fro m den o m in ato r term s n b o th th e n u m erato r an d µkm = In crease th e o b jectiv e fu n ctio n b y su b tractin g th e den o m in ato r co u n ts fro m th e n u m erato r co u n ts. n M o re sp ecifi cally , let: num = θmk X den = θmk X num Oi(t)γmki (t) num den θmk − θmk + Dmk µ0mk num − γ den + D γmk mk mk T h e fac tor Dmk is c h ose larg e en ou g h to av oid p rob lems w ith n eg ativ e c ou n t d ifferen c es. N otic e th at ig n orin g th e d en omin ator c ou n ts resu lts in th e n ormal mean estimate. A similar ex p ression ex ists for v arian c e estimation . i,t den Oi(t)γmki (t) i,t 32 Computing the Denominator Counts The major component of the MMI calculation is the computation of the denominator counts. Theoretically, we must compute counts for every possible sentence hypotheis. How can we reduce the amount of computation? E E C S E 6 8 7 0 : A dvanced S peech R ecognition 34 33 Counts can be collected on this HMM the same way counts are collected on the HMM representing the sentence corresponding to the correct path. 2. Use a ML decoder to generate a “reasonable” number of sentence hypotheses and then use FST operations such as determinization and minimization to compactify this into an HMM graph (lattice). 3. Do not regenerate the lattice after every MMI iteration. 1. From the previous lectures, realize that the set of sentence hypotheses are just captured by a large HMM for the entire sentence: E E C S E 6 8 7 0 : A d v an c ed S p eec h R ec og n ition E E C S E 6 8 7 0 : A dv an ced S p eech R eco g n itio n num w h ere γmki (t) are th e co u n ts fo r state k, m ix tu re co m p o n en t m, co m p u ted fro m ru n n in g th e fo rw ard-b ack w ard alg o rith m o n den th e “co rrect” sen ten ce Si an d γmki (t) are th e co u n ts co m p u ted acro ss all th e sen ten ce h y p o th eses co rresp o n din g to Si T h e M M I E E CS E 6 8 7 0 : A dvanced Speech R ecognition 35 Results Other Computational Issues Because we ignore correlation, the likelihood of the data tends to be dominated by a very small number of lattice paths (Why?). To increase the number of confusable paths, the likelihoods are scaled with an exponential constant: X i pθ (Oi|Si)κp(Si)κ log P j κ j κ j pθ (Oi|Si ) p(Si ) F or similar reasons, a weaker language model (unigram) is used to generate the denominator lattice. This also simplifi es denominator lattice generation. E E C S E 6 8 7 0 : A dvanced S peech R ecognition Note that results hold up on a variety of other tasks as well. 36 E E C S E 6 8 7 0 : A dvanc ed S peec h R ec og nition MPE Variations and Embellishments MPE - Minimum Phone Error n b MMI - B oos ted MMI n MC E - Minimum C la s s ifi c a tion Error n fMPE/fMMI - fea ture-b a s ed MPE a nd MMI X i n n n EEC S E6 8 7 0 : A d v a nc ed S p eec h R ec og nition 38 P j pθ (Oi|Sj )κp(Sj )κA(Sref , Sj ) P j κ j κ j pθ (Oi|Si ) p(Si ) A(Sref , Sj is a phone-frame accuracy function. A measures the number of correctly labeled frames in S P ov ey [3 ] show ed this could be optimiz ed in a w ay similar to that of M M I. U sually w ork s somew hat better than M M I itself n 37 E E C S E 6 8 7 0 : A dv anced S peech R ecog nition 39 bMMI i log P j pθ (Oi|Sij )κp(Sij )κ e x p (−bA(Sij , Sr e f )) B oosts contrib ution of paths w ith low er phone error rates. A is a phone-frame accuracy function as in MPE. n n Language A rab ic D o m ain T elep h o ny H o urs 80 M L 4 3 .2 M PE 3 6 .8 bM M I 3 5 .9 pθ (Oi|Si)κp(Si)κ EEC S E6 8 7 0 : A d v anced S peech R ecog nition 40 pθ (Oi|Si)κp(Si)κ j pθ (Oi|Sij )κp(Sij )κ e x p (−bA (Sij , Si) y t = O t + M ht ) 1 1+ e2ρ x T he s u m o v er c o m p etin g m o d els ex p lic itly ex c lu d es the c o rrec t c la s s (u n lik e the o ther v a ria tio n s ) n O rig in a lly d ev elo p ed fo r g ra m m a r-b a s ed a p p lic a tio n s n C o m p a ra b le to M P E , n ev er c o m p a red to b M M I n n n A p p ro x im a tes s en ten c e erro r ra te o n tra in in g d a ta n E E C S E 6 8 7 0 : A d v a n c ed S p eec h R ec o g n itio n 41 fMPE/fMMI n where f (x) = E nglis h P arliam ent 80 8 .8 7 .2 6 .8 42 ht are the set of Gaussian likelihoods for frame t. May be clustered into a smaller number of Gaussians, may also be combined across multiple frames. T he training of M is ex ceeding ly complex inv olv ing both the g radients of your fav orite objectiv e function w ith respect to M as w ell as the model parameters θ w ith multiple passes throug h the data. R ather amaz ing ly g iv es sig nifi cant g ains both w ith and w ithout MMI. i f (log P E nglis h T elep h o ny 175 3 1 .8 2 8 .6 2 8 .3 E E C S E 6 8 7 0 : A d v anc ed S p eec h R ec o gnitio n MCE X E nglis h N ew s 50 2 5 .3 1 9 .6 1 8 .1 X Various Comparisons E E C S E 6 8 7 0 : A dv anced S peech R ecog nition 43 References fMPE/fMMI Results English BN 50 Hours, SI models [1] P. Brown (1987) “The Acoustic Modeling Problem in Automatic Speech Recognition”, PhD Thesis, Dept. of Computer Science, Carnegie-Mellon University. D EV 04 f R T 04 2 8 .7 2 5.3 2 1 .8 1 9 .2 2 1 .1 1 8 .2 [2] P.S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo (1991) “ An Inequality for Rational Functions with Applications to Some Statistical Modeling Problems”, IEEE Trans. on Acoustics, Speech and Signal Processing, 37(1) 107-113, January 1991 A ra b ic BN 1 4 00 Hours, SA T M odels [3] D. Povey and P. Woodland (2002) “Minimum Phone Error and i-smoothing for improved discriminative training”, Proc. ICASSP vol. 1 pp 105-108. EV A L 06 2 4 .9 2 2 .3 2 0.1 EV A L 07 1 9 .6 1 6 .8 1 4 .5 D EV 07 M L 1 7 .1 fM P E 1 4 .3 fM P E+ M P E 1 2 .6 EEC S E6 8 7 0: A dv a nc ed Sp eec h R ec ognition R T 03 M L 1 7 .5 fBM M I 1 3 .2 fb M M I+ b M M I 1 2 .6 44 EECS E6 870: Advanced Speech Recognition ROVER - Recognizer Output Voting Error Reduction[1] 45 ROVER - Basic Architecture ROVER is a technique for combining recognizers together to improve recognition accuracy. The concept came from the following set of observations about 11 years ago: n C ompare errors of recognizers from two d ifferent sites n Error rate performance similar - 4 4 .9 % vs 4 5 .1% Out of 5 9 19 total errors, 7 3 8 are errors for only recognizer A and 7 5 5 for only recognizer B n Systems may come from multiple sites n C an b e a sin g le site w ith d ifferen t processin g sch emes S uggests that some sort of voting process across recognizers might red uce the overall error rate n EEC S E6 8 7 0 : A d vanced S peech Recognition 46 n E E C S E 6 8 7 0 : A d v an ced Speech R ecog n ition 47 EECS E6870: Advanced Speech Recognition ROVER - Example ROVER - Text String Alignment Process 48 EECS E6870: Advanced Speech Recognition ROVER - Form Confusion Sets 49 ROVER - Aligning Strings Against a Network EECS E6870: Advanced Speech Recognition 50 Solution: Alter cost function so that there is only a substitution cost if no member of the reference network matches the target symbol. E E C S E 6 8 7 0 : Ad v anced Sp eech R ecognition 51 ROVER - Aligning Networks Against Networks ROVER - Vote n Main Idea: for each confusion set, take word with highest frequency SYS1 4 4 .9 n SYS2 4 5 .1 SYS3 4 8 .7 SYS4 4 8 .9 SYS5 5 0 .2 ROVER 3 9 .7 Im p rov em ent v ery im p ressiv e - as large as any signifi cant algorithm adv ance. EEC S E6 8 7 0 : A dvanced Speech Recog nition No so much a ROVER issue but will be important for confusion networks. Problem: How to score relative probabilities and deletions? Solution: cost subst(s1,s2)= (1 - p1(winner(s2)) + 1 - p2(winner(s1))/2 52 E E C S E 6 8 7 0 : A dv anced S p eech R ecognition ROVER - Example 53 ROVER - As a Function of Number of Systems [2] Error not guaranteed to be reduced. n Alphabetical: take systems in alphabetical order. n C u rv es ordered by error rate. n N ote error actu ally g oes u p slig htly w ith 9 systems EEC S E6 8 7 0 : A dv anced S p eech R ecognition 54 n S ens itiv e to initial ch oice of bas e s y s tem us ed for alignm ent ty p ically tak e th e bes t s y s tem . n E E C S E 6 8 7 0 : Adv anced S peech R ecog nition 55 ROVER - Types of Systems to Combine References ML and MMI n [1] J. Fiscus (1997) “A Post-Processing System to Yield Reduced Error Rates”, IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CA V ary ing am o u nt o f ac o u s tic c o nte x t in p ro nu nc iatio n m o de ls (T rip h o ne , Q u inp h o ne ) n n D iffe re nt le x ic o ns n D iffe re nt s ig nal p ro c e s s ing s c h e m e s (MF C C , P LP ) n A ny th ing e ls e y o u c an th ink o f! [2] H. Schwenk and J.L. Gauvain (2000) “Combining Multiple Speech Recognizers using Voting and Language Model Information” ICSLP 2000, Beijing II pp. 915-918 E E C S E 6 8 7 0 : A dv anc e d S p e e c h R e c o g nitio n R o v e r p ro v ide s an e x c e lle nt w ay to ac h ie v e c ro s s -s ite c o llab o ratio n and s y ne rg y in a re lativ e ly p ainle s s fas h io n. 56 EECS E6 870: Advanced Speech Recognition Consensus Decoding[1] - Introduction 57 Consensus Decoding - Motivation P rob lem n Standard SR evaluation procedure is word-based n Standard h y poth esis scoring functions are sentence-based G oa l E x plicitly m inim iz e word error m etric: n c = arg min EP (R|A)[WE (W, R)] = arg min W W W X P (R|A)WE (W, R) R Original work was done off N-best lists n L attic es m u c h m ore c om p ac t and h av e lower orac le error rates E E C S E 6 8 7 0 : A dvanced Speech Recog nition 58 n F or each candidate word, sum th e word posteriors and pick th e word with th e h ig h est posterior probability . n E E C S E 6 8 7 0 : A dv anc ed S p eec h R ec ognition 59 Consensus Decoding - Approach Consensus Decoding Approach - cont’d Find a multiple alignment of all the lattice paths Input L attice: HAVE n Compute the word error between two hypotheses according to the multiple alignment: VERY OFTEN I I W E(W, R) ≈ M W E(W, R) VERY MOVE OFTEN I VEAL HAVE FINE SIL SIL HAVE SIL FAST VERY SIL VERY MOVE HAVE SIL FINE VERY IT n F ind the consensus hypothesis: SIL IT WC = a rg m in M ultiple Alignm ent: W ∈W I HAVE IT VEAL FINE - MOVE - VERY OFTEN X P (R|A) ∗ M W E(W, R) R∈W E E C S E 6 8 7 0 : A dv anced S peech R ecognition FAST 60 E E CS E 6 8 7 0 : A dv anced S peech R ecognition Consensus Decoding Approach - cont’d Consensus Decoding Approach - Multiple Alignment Input: H AVE VERY O F T EN I I M O VE n Equivalence relation over word hypotheses (links) n T otal ordering of the equivalence classes VERY O F T EN I VEAL H AVE F IN E SIL SIL H AVE F IN E VERY IT H AVE Mathematical prob lem formulation: F AST VERY M O VE SIL SIL VERY SIL 61 SIL IT n O utput: I HAVE IT VEAL F IN E - M O VE - VERY O FTEN n D efi ne a partial order on sets of links which is consistent with the precedence order in the lattice C luster sets of links in the partial order to derive a total order EECS E6870: Advanced Speech Recognition 62 FA ST EEC S E6 8 7 0 : A dvanced S peech R ecog nition 63 Consensus Decoding Approach - Clustering Algorithm Obtaining the Consensus Hypothesis Input: Initializ e Clusters: a cluster consists of all the links having the same starting time, ending time and word label H AVE VERY O F T EN I M O VE I VERY O F T EN I VEAL H AVE F IN E SIL SIL H AVE VERY IT SIL F AST VERY SIL Intra-w ord Clustering: merge only clusters which are not in relation and correspond to the same word SIL IT Output: I HAVE (0.45) IT (0.3 9 ) - M O VE (0.55) - (0.6 1 ) VEAL F IN E VERY O FTEN FA ST VERY M O VE H AVE Inter-w ord Clustering: merge heterogeneous clusters which are not in relation E E C S E 6 8 7 0 : A dvanced S peech R ecognition SIL F IN E 64 EECS E6870: Advanced Speech Recognition Confusion Networks 65 Consensus Decoding on DARPA Communicator 24 I HAVE (0.45) IT (0.3 9 ) VEAL F IN E - M O VE (0.55) - VERY O FTEN (0.6 1 ) LA R G E S M A LL LA R G E S M A LL LA R G E S M A LL 22 n Confidence Annotations and Word Spotting n Sy stem Com b ination n E rror Correction W o rd E rro r R a te (% ) FA ST sLM sLM sLM sLM sLM sLM 2 2 2+C 2+C 2+C +M X 2+C +M X 20 18 16 14 40K 7 0K 2 8 0K 40K M L L R E E CS E 6 8 7 0 : Adv anced Speech R ecognition 66 A c o u s tic M o d e l EECS E6870: Advanced Speech Recognition 67 F0 8 .3 8 .5 C C + A vg 1 4 .0 1 3 .6 F0 8 .6 8 .5 Word Error Rate (%) F1 F2 F3 F4 1 5 .8 1 9 .4 1 5 .3 1 6 .0 1 5 .7 1 8 .6 1 4 .6 1 5 .3 F5 2 2 .4 1 8 .8 FX 2 3 .7 2 2 .5 F5 5 .7 5 .7 FX 4 4 .8 4 1 .1 EEC S E6 8 7 0 : A dv an c ed S p eec h Rec og n ition A vg 1 6 .5 1 6 .0 Word Error Rate (%) S y s tem B as elin e C on s en s u s S -V M 1 3 0 .2 2 8 .8 S -V M 2 3 3 .7 3 1 .2 S -V M 3 4 2 .4 4 1 .6 RO V ER 2 9 .2 2 8 .5 C C + Word Error Rate (%) F1 F2 F3 F4 1 8 .6 2 7 .9 2 6 .2 1 0 .7 1 8 .1 2 6 .1 2 5 .8 1 0 .5 Consensus Decoding on Voice Mail Consensus Decoding on Broadcast News 68 EEC S E6 8 7 0 : A dv an c ed S p eec h Rec og n ition System Combination Using Confusion Networks 69 System Combination Using Confusion Networks If we have multiple systems, we can combine the concept of ROVER with confusion networks as follows: U se the same process as ROVER to alig n confusion networks n T ake the overall confusion network and ad d the posterior probabilities for each word . n EEC S E6 8 7 0 : A d vanced S peech Recog nition 70 F or each confusion set, pick the word with the hig hest summed posteriors. n EECS E6870: Advanced Speech Recognition 71 References EECS E6870: Advanced Speech Recognition 72 COURSE FEEDBACK Was this lecture mostly clear or unclear? muddiest topic? n What was the O ther feedb ack (pace, content, atmosphere)? n E E C S E 6 8 7 0 : A dv anced S peech R ecog nition 74 [1] L. Mangu, E. Brill and A. Stolcke (2000) “Finding consensus in speech recognition: word error minimization and other applications of confusion networks”, Computer Speech and Language 14(4), 373-400. Results of Confusion-Network-Based System Combination EECS E6 8 70: Adv anced Speech R ecognition 73