ZiadAlBawab.pdf

An Analysis-by-Synthesis Approach to Vocal
Tract Modeling for Robust Speech Recognition
Ziad Al Bawab
([email protected])
Electrical and Computer Engineering
Carnegie Mellon University
Work in collaboration with:
Bhiksha Raj
Lorenzo Turicchia (MIT)
and Richard M. Stern
IBM Research
October 9, 2009
Talk Outline
I.
Introduction
II. Deriving vocal tract shapes from EMA data using a physical
model
III. Analysis-by-synthesis framework
IV. Dynamic articulatory model
V. Conclusion
2
Conventional Generative Model
SPEECH: /S/-/P/-/IY/-/CH/
/S/
/P/
/IY/
Maximum
Likelihood
/CH/
S1
S2
Sn
F1
F2
…
F13
F1
F2
…
F13
F1
F2
…
F13
Acoustic Features
S-P-IY-CH
Amplitude
5000
0
-5000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.05
0.1
0.15
0.2
Time
0.25
0.3
0.35
Frequency
8000
6000
4000
2000
0
Wikipedia
0
3
The Ultimate Generative Model
SPEECH: /S/-/P/-/IY/-/CH/
/S/
Lips
Separation
/P/
S12
S11
S13
/IY/
Articulatory modeling
/CH/
Speech is actually
generated by the
vocal tract!
S1n
S14
Articulatory Targets
Tongue Tip
S21
S22
F1
F2
…
F13
F1
F2
…
F13
S23
S24
S2n
F1
F2
…
F13
Acoustic Features
Physical
Generative Model
S-P-IY-CH
Amplitude
5000
0
-5000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Frequency
8000
6000
4000
Physical model of sound generation
2000
0
0
0.05
0.1
0.15
0.2
Time
0.25
0.3
0.35
4
The Missing Science
•
Need a framework that can explicitly model the articulatory space
(configurations and dynamics) that can help alleviate problems like
coarticulation, articulatory target undershoot, asynchrony of
articulators, and pronunciation variations
•
Current approaches in articulatory modeling (Livescu, Deng, Erler,
and more) attempt to learn and apply constraints based on inferences
from surface level acoustic observations or from linguistic sources
•
Need to learn from real articulatory data
•
Need a mapping from articulatory space to the acoustic domain based
on the physical generative process that is more natural (i.e. accurate)
and can generalize better than learning the mapping statistically (i.e.
from parallel articulatory and acoustic data)
5
MOCHA Database
MOCHA Apparatus
Raw Articulatory Measurements
6
MOCHA EMA Data
1
0.5
0
VL
UL
-0.5
y cm
TD
TB
UI
TT
-1
-1.5
-2
-2.5
-3
-3.5
-2
LI
LL
-1 0
1
2 3
x cm
4
5
6
7
7
Maeda Parameters
Upper Palate
Lips
Glottis
7 Maeda Parameters
P1
Maeda’s Model
P7
Area Functions
(Acoustic Tubes)
Area
Length
A1
L1
…
…
A36
L36
8
Articulatory Speech Synthesis
Area Functions
(Acoustic Tubes)
Area
Length
A1
…
L1
…
Sondhi and Schroeter Model
VT Transfer
Area to
Transfer Function
of Each Section
Function
A36 L36
9
Deriving Realistic Vocal Tract Shapes from ElectroMagnetic
Articulograph Data via Geometric Adaptation and Profile Fitting
• Problem Overview:
– Speech synthesis solely from EMA data using:
• Knowledge of the geometry of the vocal tract
• Knowledge of the physics of the speech generation process
• Approach Followed:
– Compute realistic vocal tract shapes from EMA data
1. Adapting Maeda’s geometric vocal tract model to EMA data
2. Search for best fit of the tongue and lips profile contours to EMA data
– Synthesize speech from vocal tract shapes
3. Articulatory synthesis using the Sondhi and Schroeter model
10
1. Vocal Tract Adaptation
Parameters
6
4
y cm
Lips

d
Upper Wall
21
Upper Incisor
2
29
Tongue
0
10
Origin
•
•
•
•
•
Origin
Upper Wall Shift
θ
d
Lips Separation
-2
Inner Wall
-4
2
1
Larynx Edges
-6
-8
-6
-4
-2
0
x cm
2
4
6
11
Adaptation Result [1]
4
0
UL
UL
+
Estimated EMA Upper Wall
d
2
UI
UI
29 UI
Maeda Upper Wall
15
VL
VL
26
TT
TB
TD
LI LI
-2
y cm
LLLL
-4
Inner Wall
-6
2
1
-8
Larynx
-10
-2
0
2
4
x cm
6
[1] Z. Al Bawab, L. Turicchia, R. M. Stern, and B. Raj, “Deriving Vocal Tract Shapes From ElectroMagnetic
Articulograph Data Via Geometric Adaptation and Matching,” in Interspeech, Brighton, UK, September
2009.
8
12
10
2. Search Results
EMA points in purple for phone
‘II’ as in “Seesaw = /S-II-S-OO/”
EMA points in purple for phone
‘@@’ as in “Working = /W-@@K-I-NG/”
13
3. Synthesis Results
Acoustic tubes model for phone
‘II’ as in “Seesaw = /S-II-S-OO/”
Acoustic tubes model for phone
‘@@’ as in “Working = /W-@@K-I-NG/”
14
Creating a Realistic Codebook and
Adapted Articulatory Transfer Functions
Velum Area
Codeword:
p1
p2
p3
p4
p5
p6
p7
VA
15
Projecting the 44 Phones Codewords’ Means
using Multi-Dimensional Scaling (MDS)
1.5
II
1
NG
EI
I@
GK
JH
SH
E
E@
A I
ZH
0.5
y
CH
Y
H
AI
AU
UU
0
U
U@
@@ L
AA
RUH
@
N
T D
SZ DH
TH
-0.5
O
OU
OI
-1
FV
OO
W
PB
-1.5
-2.5
-2
-1.5
-1
-0.5
x
0
M
0.5
1
1.5
16
Deriving Analysis-by-Synthesis Features[2]
Compare signals generated
from a codebook of valid
vocal tract configurations
Energy, Pitch
to the incoming signal to
produce a “distortion” feature
vector
Articulatory
Space
Speech
MFCC
codeword 1
P1
Distortion
Feature Vector
Synthesis
MFCC
P7
d1
Mel-Cepstral
Distortion
Articulatory
Configurations
codeword N
dN
P1
Synthesis
MFCC
P7
[2] Z. Al Bawab, B, Raj, and R. M. Stern, “Analysis-by-synthesis features for speech recognition,” IEEE
International Conference on Acoustics, Speech, and Signal Processing, April 2008, Las Vegas, Nevada.
17
Mixture Probability Density Function
• For a given frame, the output probability of each
state in the HMM is a mixture density over a set
of M codewords:
Weight of each
codeword
Likelihood of input given
the codeword and state
18
HMM Framework
19
Priors From EMA
EMA measurements
TT
TB
TD
Time
cd1
cd2
cd1
cd3
cd2
20
Update Equations
• For each phone, we estimate
and
for each state as:
P( xu | cd j )   j exp
21
2
  j d uj
Weights for Phone ‘OU’ Projected on Codewords-MDS Space
4
3
2
1
0
-1
-2
-3
-4
-6
Weights Flat Init
y
y
Priors from EMA
-4
-2
0
2
4
4
3
2
1
0
-1
-2
-3
-4
-6
y
y
x
Weights Init from EMA
-4
-2
0
x
2
4
4
3
2
1
0
-1
-2
-3
-4
-6
-4
-2
0
2
4
x
Weights Init from EMA + Adaptation
4
3
2
1
0
-1
-2
-3
-4
-6
-4
-2
0
2
4
x
22
Experimental Setup
• Segmented phone recognition on the MOCHA Database (9
speakers, 460 TIMIT British English utterances per speaker,
44 phones)
• Articulatory codebook composed of 1024 different Maeda
configurations derived from MOCHA EMA data
• LDA dimensionality reduction of the distortion vector to 20
features per frame, phones being the classes of
transformation
23
Experimental Setup Cont’d
• Distortion measure used is the Mel-Cepstral distortion:
12
10
MCD(C incoming, C synth) 
2 (Cincoming(k )  Csynth(k )) 2
ln 10 k 1
• Classify each phone c according to:
ˆc  arg max c P(c) P(MFCC | c) P( DF | c)(1 )
24
Summary of Phone Error Rates Results [3]
Features (dimension)
MFCC + CMN (13)
Dist Feat (1024)
(Prob. Combination α = 0.2)
Dist Feat(1024)
(Prob. Combination α = 0.2)
Adapted Dist Feat (1024)
(Prob. Combination α = 0.25)
Dist Feat + LDA + CMN (20)
(Prob. Combination α = 0.6)
Topology
Obser Prob / Init
3S-128M-HMM
Gaussian/VQ
3S-1024M-HMM
Exponential/Flat
Sparsity = 21%
3S-1024M-HMM
Exponential/EMA
Sparsity = 51%
3S-1024M-HMM
Exponential/EMA
Sparsity = 51%
3S-128M-HMM
Gaussian/VQ
Sparsity = 0%
fsew0
14,352
msak0
14,302
Both
28,654
61.6%
55.9%
58.8%
57.6%
53.7%
55.7%
5.3%
58.3%
53.9%
56.1%
4.6%
58.4%
53.1%
55.7%
5.3%
54.9%
49.8%
52.4%
10.9%
[3] Z. Al Bawab, B, Raj, and R. M. Stern, “A Hybrid Physical and Statistical Dynamic Articulatory
Framework Incorporating Analysis-by-Synthesis for Improved Phone Classification,” Submitted to
ICASSP 2010, Dallas, Texas.
25
Improvement
Summary of our Contribution
Conventional HMM
Production Based HMM
Abstract, no physical
meaning
Real articulatory
configurations
Gaussian probability
using acoustic features
Exponential probability
based on the analysis-bysynthesis distortion
features
Adaptation
VTLN, MLLR, MAP
Vocal tract geometric
model adaptation
Transition Probability
Based on acoustic
observation
Can be leaned from
articulatory dynamics
States
Output Observation
Probability
26
Conclusion
• A model that mimics the actual physics of the vocal tract
results in better classification performance
• Developed a hybrid physical and statistical dynamic
articulatory framework that incorporates analysis-bysynthesis for improved phone classification
• Recent databases open new horizons to better
understand the articulatory phenomena
• Current advancements in computations and machine
learning algorithms facilitate the integration of physical
models in large scale systems
27
• THANK YOU 
28