Presentation

Discriminative Keyword
Spotting
Joseph Keshet, IDIAP
Joint work with: David Grangier, IDIAP
Samy Bengio, Google Inc.
Outline
• Problem Definition
• Keyword Spotting with HMM
– current approaches
– HMM limitations
• Discriminative Keyword Spotting
– derivation
– analysis
– feature functions
– practicalities & algorithms
• Experimental Results
Problem Definition
Goal: find a keyword in given the speech
signal
h iy
he's
z
bcl b ao
bought
t ix
it
tcl
Problem Definition
Goal: find a keyword in given the speech
signal
h iy
he's
z
bcl b ao
bought
t ix
it
tcl
Problem Definition
Notation:
alignment sequence s̄ s1 s2 s3
keyword phoneme seqeunce p̄
keyword k
bcl b
s4 e4
ao
bought
t
Problem Definition
predicted
decision
speech
signal
x̄
p̄ = /b ao t/
keyword
(phoneme
sequence)
Keyword
Spotter
f (x̄, p̄)
detection (yes/no)
!
s̄
predicted
timing
Fat is Good
The performance of keyword spotting system is
measured by Receiver operating Characteristics
(ROC) curve.
true positive =
detected utterances with keywords
total utterances with keywords
false positive =
detected utterances without keywords
total utterances without keywords
Fat is Good
true positive =
detected utterances with keywords
total utterances with keywords
false positive =
detected utterances without keywords
true positive rate
The performance of keyword spotting system is
measured by Receiver operating Characteristics
(ROC) curve.
area under curve
A
total utterances without keywords
false positive rate
Fat is Good
true positive =
detected utterances with keywords
total utterances with keywords
false positive =
detected utterances without keywords
true positive rate
The performance of keyword spotting system is
measured by Receiver operating Characteristics
(ROC) curve.
A=1
total utterances without keywords
false positive rate
Fat is Good
true positive =
detected utterances with keywords
total utterances with keywords
false positive =
detected utterances without keywords
true positive rate
The performance of keyword spotting system is
measured by Receiver operating Characteristics
(ROC) curve.
A
total utterances without keywords
false positive rate
Fat is Good
true positive =
detected utterances with keywords
total utterances with keywords
false positive =
detected utterances without keywords
true positive rate
The performance of keyword spotting system is
measured by Receiver operating Characteristics
(ROC) curve.
area under curve
A
total utterances without keywords
false positive rate
HMM-based Keyword Spotting
HMM-based Keyword Spotting
Whole Word Modeling
bought
q̄
x̄
10 ms
[Rahim et al, 1997; Rohlicek et al, 1989]
HMM-based Keyword Spotting
Whole Word Modeling
a garbage
model
bought
q̄
x̄
10 ms
[Rahim et al, 1997; Rohlicek et al, 1989]
HMM-based Keyword Spotting
Whole Word Modeling
bought
q̄
x̄
10 ms
[Rahim et al, 1997; Rohlicek et al, 1989]
HMM-based Keyword Spotting
Phoneme-Based
garbage
p̄
q̄
h
iy
bought
b
ao
garbage
t
ih
t
x̄
10 ms
[Bourland et al, 1994; Manos & Zue, 1997; Rohlicek et al, 1993]
HMM-based Keyword Spotting
Large Vocabulary Based
• Linguistic constrains on the garbage model
• Does a human listener need to know large
vocabulary in order to
recognize one word?
(Cardillo et al, 2002; Rose & Paul, 1990; Szoke et al, 2005; Weintraub, 1995)
HMM Limitations as a
Speech Model
• Conditional independence of
observations given the state sequence
• Feature extraction imposed by framebased observations
• Weak duration modeling
• The likelihood is dominated by the
output probabilities, rather than by the
transition probabilities as well
[Ostendorf et al., 96; Young, 96]
HMM Limitations as a
Learning Algorithm
• Convergence of the EM procedure to local
maximum
• Overfitting effects due to the large number
of parameters
• Does not address the goal of
maximizing the area under the
ROC curve
Discriminative Approach
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Learning Paradigm
Discriminative learning from examples
S=
keyword
(phoneme
sequence)
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
utterance in
which the keyword
is uttered
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
utterance in
which the
keyword is not
uttered
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
alignment of the
keyword and the utterance
with keyword
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Discriminative
Keyword
Spotting
Keyword spotter
f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Class of all keyword
spotting functions Fw
Keyword spotter
Discriminative
Keyword
Spotting
f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
f (x̄, p̄) = max w · φ(x̄, p̄, s̄)
s̄
w ∈ Rn
Keyword spotter
Discriminative
Keyword
Spotting
f (x̄, p̄)
Feature Functions
We define 7 feature functions of the form:
sequence of
acoustic features
keyword
(phoneme
sequence)
(x̄, p̄)
s̄
Suggested
timing
Feature
Functions
φj
Confidence in
the keyword
and suggested
timing
sequence
R
Feature Functions I
Cumulative spectral change around the boundaries
φj (x̄, p̄, s̄) =
|p̄|−1
!
i=2
d(x−j+si , xj+si ), j ∈ {1, 2, 3, 4}
−j + si si j + si
Feature Functions I
Cumulative spectral change around the boundaries
φj (x̄, p̄, s̄) =
|p̄|−1
!
i=2
d(x−j+si , xj+si ), j ∈ {1, 2, 3, 4}
−j + si si j + si
Feature Functions II
Cumulative confidence in the phoneme sequence
φ5 (x̄, p̄, s̄) =
|p̄| si+1 −1
! !
i=1
pi−1 = t
t=si
pi = eh
...
si−1
g(xt , pi )
...
si
...
si+1
Feature Functions II
Cumulative confidence in the phoneme sequence
φ5 (x̄, p̄, s̄) =
|p̄| si+1 −1
! !
i=1
g(xt , pi )
t=si
We build a static frame-based
phoneme classifier
g :X ×Y →R
g(xt , pi ) is the confidence that
phoneme pi was
pi−1uttered
= t atpi = eh
frame xt
...
...
...
[Dekel, Keshet, Singer, ‘04]
si−1
si
si+1
Feature Functions II
Cumulative confidence in the phoneme sequence
φ5 (x̄, p̄, s̄) =
|p̄| si+1 −1
! !
i=1
pi−1 = t
t=si
frame based
phoneme
classifier
pi = eh
...
si−1
g(xt , pi )
...
si
...
si+1
Feature Functions III
Phoneme duration model
φ6 (x̄, p̄, s̄) =
|p̄|
!
i=1
log N (si+1 − si ; µ̂pi , σ̂pi )
si − si−1
si+1 − si
Feature Functions III
Phoneme duration model
φ6 (x̄, p̄, s̄) =
|p̄|
!
i=1
log N (si+1 − si ; µ̂pi , σ̂pi )
µ̂pi - average length of phoneme pi
σ̂pi - standard deviation of the
length of phoneme pi
si − si−1
si+1 − si
Feature Functions III
Statistics of
phoneme pi
Phoneme duration model
φ6 (x̄, p̄, s̄) =
|p̄|
!
i=1
log N (si+1 − si ; µ̂pi , σ̂pi )
si − si−1
si+1 − si
Feature Functions IV
Speaking-rate modeling (“dynamics”)
#
|p̄|−1 "
2
! si+1 − si
si − si−1
φ7 (x̄, p̄, s̄) = −
−
µ̂pi
µ̂pi−1
i=2
Spectogram at different rates of articulation (after Pickett, 1980)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Keyword spotter f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
f (x̄, p̄) = max w · φ(x̄, p̄, s̄)
s̄
w ∈ Rn
Discriminative
Keyword
Spotting
Keyword spotter f (x̄, p̄)
Large-Margin Model
φ(x̄ , p̄, s̄ )
−
φ(x̄ , p̄, s̄ )
−
""
"
φ(x̄ , p̄, s̄)
+
Large-Margin Model
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Model
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Model
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Model
w
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Model
w
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Model
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
w
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin and Noise
w
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin and Noise
w
negative utterance
with best timing
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
positive utterance
with correct
timing
negative
utterance with
other timing
φ(x̄ , p̄, s̄ )
−
""
Large-Margin Derivation
w
φ(x̄ , p̄, s̄ )
−
"
φ(x̄ , p̄, s̄)
+
d
w · φ(x̄+ , p̄, s̄) − w · φ(x̄− , p̄, s̄" )
d=
"w"
φ(x̄ , p̄, s̄ )
−
""
w · φ(x̄+ , p̄, s̄) − w · φ(x̄− , p̄, s̄" ) ≥ 1 ∀s̄"
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
f (x̄, p̄) = max w · φ(x̄, p̄, s̄)
s̄
w∈R
n
Discriminative
Keyword
Spotting
Keyword spotter f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
f (x̄, p̄) = max w · φ(x̄, p̄, s̄)
s̄
w∈R
n
Keyword spotter f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
f (x̄, p̄) = max w · φ(x̄, p̄, s̄)
s̄
w∈R
n
Keyword spotter f (x̄, p̄)
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
max d
w
s.t. w ·
+
φ(x̄j , p̄j , s̄j )
−w·
−
"
φ(x̄j , p̄j , s̄ )
Keyword spotter f (x̄, p̄)
≥ 1 ∀j ∀s̄
"
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
1
min !w!2
w 2
s.t. w ·
+
φ(x̄j , p̄j , s̄j )
−w·
−
"
φ(x̄j , p̄j , s̄ )
Keyword spotter f (x̄, p̄)
≥ 1 ∀j ∀s̄
"
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
Exponential
number of
constraints
1
min !w!2
w 2
s.t. w ·
+
φ(x̄j , p̄j , s̄j )
−w·
−
"
φ(x̄j , p̄j , s̄ )
Keyword spotter f (x̄, p̄)
≥ 1 ∀j ∀s̄
"
Learning Paradigm
Discriminative learning from examples
S=
+
−
+
−
{(p̄1 , x̄1 , x̄1 , s̄1 ), . . . , (p̄m , x̄m , x̄m , s̄m )}
1
min !w!2
w 2
s.t. w ·
+
φ(x̄j , p̄j , s̄j )
−w·
−
"
φ(x̄j , p̄j , s̄ )
Keyword spotter f (x̄, p̄)
≥ 1 ∀j ∀s̄
"
Iterative Algorithm
Given a training set: S =
Find w
{
!
w = arg min
w·
1
2
!w!
2
+
φ(x̄j , p̄j , s̄j )
−w·
+
−
{(p̄j , x̄j , x̄j , s̄j )}
such that
−
"
φ(x̄j , p̄j , s̄ )
≥ 1 ∀j ∀s̄
"
Iterative Algorithm
Given a training set: S =
Find w
{
!
w = arg min
w·
1
2
!w!
2
+
φ(x̄j , p̄j , s̄j )
−w·
+
−
{(p̄j , x̄j , x̄j , s̄j )}
such that
−
"
φ(x̄j , p̄j , s̄ )
≥ 1 ∀j ∀s̄
Exponential
number of
constraints
"
Iterative Algorithm
Given a training set: S =
Find w
{
!
w = arg min
w·
1
2
!w!
2
+
φ(x̄j , p̄j , s̄j )
−w·
+
−
{(p̄j , x̄j , x̄j , s̄j )}
such that
−
"
φ(x̄j , p̄j , s̄ )
≥ 1 ∀j ∀s̄
"
Iterative Algorithm
Denote current suggestion by wj−1
Process one example
{
+
−
(p̄j , x̄j , x̄j , s̄j )
at a time
1
2
wj = arg min !w − wj−1 ! such that
2
+
−
"
"
w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄
Iterative Algorithm
Denote current suggestion by wj−1
Process one example
{
+
−
(p̄j , x̄j , x̄j , s̄j )
at a time
1
2
wj = arg min !w − wj−1 ! such that
2
+
−
"
"
w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄
Exponential
number of
constraints
Iterative Algorithm
Denote current suggestion by wj−1
Process one example
{
+
−
(p̄j , x̄j , x̄j , s̄j )
at a time
1
2
wj = arg min !w − wj−1 ! such that
2
+
−
"
"
w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1 ∀s̄
Iterative Algorithm
Approximation: Replace exponentially many
constraints with a single (most violated) constraint.
Define: s̄ = arg max wj−1 ·
!
{
s̄
−
φ(x̄j , p̄j , s̄)
1
2
wj = arg min !w − wj−1 ! such that
2
+
−
"
w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1
Iterative Algorithm
Approximation: Replace exponentially many
constraints with a single (most violated) constraint.
Define: s̄ = arg max wj−1 ·
!
{
s̄
−
φ(x̄j , p̄j , s̄)
1
2
wj = arg min !w − wj−1 ! such that
2
+
−
"
w · φ(x̄j , p̄j , s̄j ) − w · φ(x̄j , p̄j , s̄ ) ≥ 1
1 − wj−1 ∆φ
wj = wj−1 +
"∆φ"2
−
"
∆φ = w · φ(x̄+
,
p̄
,
s̄
)
−
w
·
φ(x̄
,
p̄
,
s̄
)
j j
j
j
j
Iterative Algorithm
Input: training set S =
Initialize: w0 = 0
For each example
!
Predict: s̄
Set: ∆φ =
+
−
{(p̄j , x̄j , x̄j , s̄j )}
+
−
(p̄j , x̄j , x̄j , s̄j )
−
= arg max wj−1 · φ(x̄j , p̄j , s̄)
s̄
+
−
"
φ(x̄j , p̄j , s̄j ) − φ(x̄j , p̄j , s̄ )
If w · ∆φ ≤ 1
1 − wj−1 ∆φ
Update: wj = wj−1 +
"∆φ"2
Output Choose wj which attains the lowest cost
on a validation set.
Formal Properties
• Convex optimization problem - single minimum
• Theorem I - Worse case analysis
Area Under Curve during the training phase is
high
m
!
1
2C
! 2
1 − Ã ≤ #w # +
m
m
!(w! )
i=1
• Theorem II - Unseen test data
The expected Area Under Curve on unseen
examples is low in probability
1
1−A≤
m
m
!
"
#w #
1
!(w ) +
+ O ln(m/δ), √
m
mval
i=1
!
! 2
#
Formal Properties
• The proof of the theorems based on the
Wilcoxon-Mann-Whitney statistic defining the
AUC as
!
!
1 !
1 !
1
A=
Ak =
+
−
|K|
|K|
|X
||X
+ −
−
k
k | +
k∈K
k∈K
x̄ ∈Xk x̄ ∈Xk
{f (x̄,p̄k )>f (x̄,p̄k )}
• Our algorithm maximizes the summand, hence
maximizing the AUC.
Experimental Results
Training Set Up
• TIMIT corpus
• Phoneme representation:
– 39 phonemes (Lee & Hon, 1989)
• Acoustic Representation:
– MFCC+∆+∆∆ (ETSI standard)
• TIMIT training set:
– 500 utterances for training set of the feature
functions
– 3116 utterance used for training set
– 80 utterances used for validation (40 keywords)
Results on TIMIT
1
discriminative
HMM
true positive rate
0.8
Area under the
ROC curve:
0.99 discriminative
0.94 HMM
0.6
0.4
0.2
0
0
0.2
0.4
0.6
false positive rate
0.8
1
Results on WSJ
1
discriminative
HMM
true positive rate
0.8
0.6
Area under the
ROC curve:
0.94 discriminative
0.87 HMM
0.4
0.2
0
0
0.2
0.4
0.6
false positive rate
0.8
1
Other Datasets
Discriminative Algo.
HMM
TIMIT
0.996
0.941
HTIMIT
0.949
0.922
WSJ
0.942
0.870
OGI Stories
0.769
0.722
Keshet, Grangier and Bengio, “Discriminative Keyword Spotting”, Speech Comm. (in
submission)
Practicalities & Algorithms
• The quadratic programming
– Algorithm for solving the quadratic
programming with exponential number of
constraints
[Keshet, Grangier and Bengio, 2006]
• Training the feature function classifiers
– Hierarchical phoneme classifier
[Dekel, Keshet and Singer, 2004]
• Non-separable case
– Common technique in training soft SVM
[Cristianini & Shawe-Taylor, 2000; Vapnik, 1998]
Thanks!