Discriminative Keyword Spotting Joseph Keshet, IDIAP Joint work with: David Grangier, IDIAP Samy Bengio, Google Inc. Outline • Problem Definition • Keyword Spotting with HMM – current approaches – HMM limitations • Discriminative Keyword Spotting – derivation – analysis – feature functions – practicalities & algorithms • Experimental Results Problem Definition Goal: find a keyword in given the speech signal h iy he's z bcl b ao bought t ix it tcl Problem Definition Goal: find a keyword in given the speech signal h iy he's z bcl b ao bought t ix it tcl Problem Definition Notation: alignment sequence s̄ s1 s2 s3 keyword phoneme seqeunce p̄ keyword k bcl b s4 e4 ao bought t Problem Definition predicted decision speech signal x̄ p̄ = /b ao t/ keyword (phoneme sequence) Keyword Spotter f (x̄, p̄) detection (yes/no) ! s̄ predicted timing Fat is Good The performance of keyword spotting system is measured by Receiver operating Characteristics (ROC) curve. true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords total utterances without keywords Fat is Good true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords true positive rate The performance of keyword spotting system is measured by Receiver operating Characteristics (ROC) curve. area under curve A total utterances without keywords false positive rate Fat is Good true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords true positive rate The performance of keyword spotting system is measured by Receiver operating Characteristics (ROC) curve. A=1 total utterances without keywords false positive rate Fat is Good true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords true positive rate The performance of keyword spotting system is measured by Receiver operating Characteristics (ROC) curve. A total utterances without keywords false positive rate Fat is Good true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords true positive rate The performance of keyword spotting system is measured by Receiver operating Characteristics (ROC) curve. area under curve A total utterances without keywords false positive rate HMM-based Keyword Spotting HMM-based Keyword Spotting Whole Word Modeling bought q̄ x̄ 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] HMM-based Keyword Spotting Whole Word Modeling a garbage model bought q̄ x̄ 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] HMM-based Keyword Spotting Whole Word Modeling bought q̄ x̄ 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] HMM-based Keyword Spotting Phoneme-Based garbage p̄ q̄ h iy bought b ao garbage t ih t x̄ 10 ms [Bourland et al, 1994; Manos & Zue, 1997; Rohlicek et al, 1993] HMM-based Keyword Spotting Large Vocabulary Based • Linguistic constrains on the garbage model • Does a human listener need to know large vocabulary in order to recognize one word? (Cardillo et al, 2002; Rose & Paul, 1990; Szoke et al, 2005; Weintraub, 1995) HMM Limitations as a Speech Model • Conditional independence of observations given the state sequence • Feature extraction imposed by framebased observations • Weak duration modeling • The likelihood is dominated by the output probabilities, rather than by the transition probabilities as well [Ostendorf et al., 96; Young, 96] HMM Limitations as a Learning Algorithm • Convergence of the EM procedure to local maximum • Overfitting effects due to the large number of parameters • Does not address the goal of maximizing the area under the ROC curve Discriminative Approach Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Learning Paradigm Discriminative learning from examples S= keyword (phoneme sequence) + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} utterance in which the keyword is uttered Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} utterance in which the keyword is not uttered Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} alignment of the keyword and the utterance with keyword Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Discriminative Keyword Spotting Keyword spotter f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Class of all keyword spotting functions Fw Keyword spotter Discriminative Keyword Spotting f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} f (x̄, p̄) = max w · φ(x̄, p̄, s̄) s̄ w ∈ Rn Keyword spotter Discriminative Keyword Spotting f (x̄, p̄) Feature Functions We define 7 feature functions of the form: sequence of acoustic features keyword (phoneme sequence) (x̄, p̄) s̄ Suggested timing Feature Functions φj Confidence in the keyword and suggested timing sequence R Feature Functions I Cumulative spectral change around the boundaries φj (x̄, p̄, s̄) = |p̄|−1 ! i=2 d(x−j+si , xj+si ), j ∈ {1, 2, 3, 4} −j + si si j + si Feature Functions I Cumulative spectral change around the boundaries φj (x̄, p̄, s̄) = |p̄|−1 ! i=2 d(x−j+si , xj+si ), j ∈ {1, 2, 3, 4} −j + si si j + si Feature Functions II Cumulative confidence in the phoneme sequence φ5 (x̄, p̄, s̄) = |p̄| si+1 −1 ! ! i=1 pi−1 = t t=si pi = eh ... si−1 g(xt , pi ) ... si ... si+1 Feature Functions II Cumulative confidence in the phoneme sequence φ5 (x̄, p̄, s̄) = |p̄| si+1 −1 ! ! i=1 g(xt , pi ) t=si We build a static frame-based phoneme classifier g :X ×Y →R g(xt , pi ) is the confidence that phoneme pi was pi−1uttered = t atpi = eh frame xt ... ... ... [Dekel, Keshet, Singer, ‘04] si−1 si si+1 Feature Functions II Cumulative confidence in the phoneme sequence φ5 (x̄, p̄, s̄) = |p̄| si+1 −1 ! ! i=1 pi−1 = t t=si frame based phoneme classifier pi = eh ... si−1 g(xt , pi ) ... si ... si+1 Feature Functions III Phoneme duration model φ6 (x̄, p̄, s̄) = |p̄| ! i=1 log N (si+1 − si ; µ̂pi , σ̂pi ) si − si−1 si+1 − si Feature Functions III Phoneme duration model φ6 (x̄, p̄, s̄) = |p̄| ! i=1 log N (si+1 − si ; µ̂pi , σ̂pi ) µ̂pi - average length of phoneme pi σ̂pi - standard deviation of the length of phoneme pi si − si−1 si+1 − si Feature Functions III Statistics of phoneme pi Phoneme duration model φ6 (x̄, p̄, s̄) = |p̄| ! i=1 log N (si+1 − si ; µ̂pi , σ̂pi ) si − si−1 si+1 − si Feature Functions IV Speaking-rate modeling (“dynamics”) # |p̄|−1 " 2 ! si+1 − si si − si−1 φ7 (x̄, p̄, s̄) = − − µ̂pi µ̂pi−1 i=2 Spectogram at different rates of articulation (after Pickett, 1980) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Keyword spotter f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} f (x̄, p̄) = max w · φ(x̄, p̄, s̄) s̄ w ∈ Rn Discriminative Keyword Spotting Keyword spotter f (x̄, p̄) Large-Margin Model φ(x̄ , p̄, s̄ ) − φ(x̄ , p̄, s̄ ) − "" " φ(x̄ , p̄, s̄) + Large-Margin Model φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Model negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Model negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Model w negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Model w negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Model negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + w positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin and Noise w negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin and Noise w negative utterance with best timing φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + positive utterance with correct timing negative utterance with other timing φ(x̄ , p̄, s̄ ) − "" Large-Margin Derivation w φ(x̄ , p̄, s̄ ) − " φ(x̄ , p̄, s̄) + d w · φ(x̄+ , p̄, s̄) − w · φ(x̄− , p̄, s̄" ) d= "w" φ(x̄ , p̄, s̄ ) − "" w · φ(x̄+ , p̄, s̄) − w · φ(x̄− , p̄, s̄" ) ≥ 1 ∀s̄" Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} f (x̄, p̄) = max w · φ(x̄, p̄, s̄) s̄ w∈R n Discriminative Keyword Spotting Keyword spotter f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} f (x̄, p̄) = max w · φ(x̄, p̄, s̄) s̄ w∈R n Keyword spotter f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} f (x̄, p̄) = max w · φ(x̄, p̄, s̄) s̄ w∈R n Keyword spotter f (x̄, p̄) Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} max d w s.t. w · + φ(x̄j , p̄j , s̄j ) −w· − " φ(x̄j , p̄j , s̄ ) Keyword spotter f (x̄, p̄) ≥ 1 ∀j ∀s̄ " Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} 1 min !w!2 w 2 s.t. w · + φ(x̄j , p̄j , s̄j ) −w· − " φ(x̄j , p̄j , s̄ ) Keyword spotter f (x̄, p̄) ≥ 1 ∀j ∀s̄ " Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} Exponential number of constraints 1 min !w!2 w 2 s.t. w · + φ(x̄j , p̄j , s̄j ) −w· − " φ(x̄j , p̄j , s̄ ) Keyword spotter f (x̄, p̄) ≥ 1 ∀j ∀s̄ " Learning Paradigm Discriminative learning from examples S= + − + − {(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )} 1 min !w!2 w 2 s.t. w · + φ(x̄j , p̄j , s̄j ) −w· − " φ(x̄j , p̄j , s̄ ) Keyword spotter f (x̄, p̄) ≥ 1 ∀j ∀s̄ " Iterative Algorithm Given a training set: S = Find w { ! w = arg min w· 1 2 !w! 2 + φ(x̄j , p̄j , s̄j ) −w· + − {(p̄j , x̄j , x̄j , s̄j )} such that − " φ(x̄j , p̄j , s̄ ) ≥ 1 ∀j ∀s̄ " Iterative Algorithm Given a training set: S = Find w { ! w = arg min w· 1 2 !w! 2 + φ(x̄j , p̄j , s̄j ) −w· + − {(p̄j , x̄j , x̄j , s̄j )} such that − " φ(x̄j , p̄j , s̄ ) ≥ 1 ∀j ∀s̄ Exponential number of constraints " Iterative Algorithm Given a training set: S = Find w { ! w = arg min w· 1 2 !w! 2 + φ(x̄j , p̄j , s̄j ) −w· + − {(p̄j , x̄j , x̄j , s̄j )} such that − " φ(x̄j , p̄j , s̄ ) ≥ 1 ∀j ∀s̄ " Iterative Algorithm Denote current suggestion by wj−1 Process one example { + − (p̄j , x̄j , x̄j , s̄j ) at a time 1 2 wj = arg min !w − wj−1 ! such that 2 + − " " w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄ Iterative Algorithm Denote current suggestion by wj−1 Process one example { + − (p̄j , x̄j , x̄j , s̄j ) at a time 1 2 wj = arg min !w − wj−1 ! such that 2 + − " " w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄ Exponential number of constraints Iterative Algorithm Denote current suggestion by wj−1 Process one example { + − (p̄j , x̄j , x̄j , s̄j ) at a time 1 2 wj = arg min !w − wj−1 ! such that 2 + − " " w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄ Iterative Algorithm Approximation: Replace exponentially many constraints with a single (most violated) constraint. Define: s̄ = arg max wj−1 · ! { s̄ − φ(x̄j , p̄j , s̄) 1 2 wj = arg min !w − wj−1 ! such that 2 + − " w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 Iterative Algorithm Approximation: Replace exponentially many constraints with a single (most violated) constraint. Define: s̄ = arg max wj−1 · ! { s̄ − φ(x̄j , p̄j , s̄) 1 2 wj = arg min !w − wj−1 ! such that 2 + − " w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 1 − wj−1 ∆φ wj = wj−1 + "∆φ"2 − " ∆φ = w · φ(x̄+ , p̄ , s̄ ) − w · φ(x̄ , p̄ , s̄ ) j j j j j Iterative Algorithm Input: training set S = Initialize: w0 = 0 For each example ! Predict: s̄ Set: ∆φ = + − {(p̄j , x̄j , x̄j , s̄j )} + − (p̄j , x̄j , x̄j , s̄j ) − = arg max wj−1 · φ(x̄j , p̄j , s̄) s̄ + − " φ(x̄j , p̄j , s̄j ) − φ(x̄j , p̄j , s̄ ) If w · ∆φ ≤ 1 1 − wj−1 ∆φ Update: wj = wj−1 + "∆φ"2 Output Choose wj which attains the lowest cost on a validation set. Formal Properties • Convex optimization problem - single minimum • Theorem I - Worse case analysis Area Under Curve during the training phase is high m ! 1 2C ! 2 1 − Ã ≤ #w # + m m !(w! ) i=1 • Theorem II - Unseen test data The expected Area Under Curve on unseen examples is low in probability 1 1−A≤ m m ! " #w # 1 !(w ) + + O ln(m/δ), √ m mval i=1 ! ! 2 # Formal Properties • The proof of the theorems based on the Wilcoxon-Mann-Whitney statistic defining the AUC as ! ! 1 ! 1 ! 1 A= Ak = + − |K| |K| |X ||X + − − k k | + k∈K k∈K x̄ ∈Xk x̄ ∈Xk {f (x̄,p̄k )>f (x̄,p̄k )} • Our algorithm maximizes the summand, hence maximizing the AUC. Experimental Results Training Set Up • TIMIT corpus • Phoneme representation: – 39 phonemes (Lee & Hon, 1989) • Acoustic Representation: – MFCC+∆+∆∆ (ETSI standard) • TIMIT training set: – 500 utterances for training set of the feature functions – 3116 utterance used for training set – 80 utterances used for validation (40 keywords) Results on TIMIT 1 discriminative HMM true positive rate 0.8 Area under the ROC curve: 0.99 discriminative 0.94 HMM 0.6 0.4 0.2 0 0 0.2 0.4 0.6 false positive rate 0.8 1 Results on WSJ 1 discriminative HMM true positive rate 0.8 0.6 Area under the ROC curve: 0.94 discriminative 0.87 HMM 0.4 0.2 0 0 0.2 0.4 0.6 false positive rate 0.8 1 Other Datasets Discriminative Algo. HMM TIMIT 0.996 0.941 HTIMIT 0.949 0.922 WSJ 0.942 0.870 OGI Stories 0.769 0.722 Keshet, Grangier and Bengio, “Discriminative Keyword Spotting”, Speech Comm. (in submission) Practicalities & Algorithms • The quadratic programming – Algorithm for solving the quadratic programming with exponential number of constraints [Keshet, Grangier and Bengio, 2006] • Training the feature function classifiers – Hierarchical phoneme classifier [Dekel, Keshet and Singer, 2004] • Non-separable case – Common technique in training soft SVM [Cristianini & Shawe-Taylor, 2000; Vapnik, 1998] Thanks!