An Analysis-by-Synthesis Approach to Vocal Tract Modeling for Robust Speech Recognition Ziad Al Bawab ([email protected]) Electrical and Computer Engineering Carnegie Mellon University Work in collaboration with: Bhiksha Raj Lorenzo Turicchia (MIT) and Richard M. Stern IBM Research October 9, 2009 Talk Outline I. Introduction II. Deriving vocal tract shapes from EMA data using a physical model III. Analysis-by-synthesis framework IV. Dynamic articulatory model V. Conclusion 2 Conventional Generative Model SPEECH: /S/-/P/-/IY/-/CH/ /S/ /P/ /IY/ Maximum Likelihood /CH/ S1 S2 Sn F1 F2 … F13 F1 F2 … F13 F1 F2 … F13 Acoustic Features S-P-IY-CH Amplitude 5000 0 -5000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 Time 0.25 0.3 0.35 Frequency 8000 6000 4000 2000 0 Wikipedia 0 3 The Ultimate Generative Model SPEECH: /S/-/P/-/IY/-/CH/ /S/ Lips Separation /P/ S12 S11 S13 /IY/ Articulatory modeling /CH/ Speech is actually generated by the vocal tract! S1n S14 Articulatory Targets Tongue Tip S21 S22 F1 F2 … F13 F1 F2 … F13 S23 S24 S2n F1 F2 … F13 Acoustic Features Physical Generative Model S-P-IY-CH Amplitude 5000 0 -5000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Frequency 8000 6000 4000 Physical model of sound generation 2000 0 0 0.05 0.1 0.15 0.2 Time 0.25 0.3 0.35 4 The Missing Science • Need a framework that can explicitly model the articulatory space (configurations and dynamics) that can help alleviate problems like coarticulation, articulatory target undershoot, asynchrony of articulators, and pronunciation variations • Current approaches in articulatory modeling (Livescu, Deng, Erler, and more) attempt to learn and apply constraints based on inferences from surface level acoustic observations or from linguistic sources • Need to learn from real articulatory data • Need a mapping from articulatory space to the acoustic domain based on the physical generative process that is more natural (i.e. accurate) and can generalize better than learning the mapping statistically (i.e. from parallel articulatory and acoustic data) 5 MOCHA Database MOCHA Apparatus Raw Articulatory Measurements 6 MOCHA EMA Data 1 0.5 0 VL UL -0.5 y cm TD TB UI TT -1 -1.5 -2 -2.5 -3 -3.5 -2 LI LL -1 0 1 2 3 x cm 4 5 6 7 7 Maeda Parameters Upper Palate Lips Glottis 7 Maeda Parameters P1 Maeda’s Model P7 Area Functions (Acoustic Tubes) Area Length A1 L1 … … A36 L36 8 Articulatory Speech Synthesis Area Functions (Acoustic Tubes) Area Length A1 … L1 … Sondhi and Schroeter Model VT Transfer Area to Transfer Function of Each Section Function A36 L36 9 Deriving Realistic Vocal Tract Shapes from ElectroMagnetic Articulograph Data via Geometric Adaptation and Profile Fitting • Problem Overview: – Speech synthesis solely from EMA data using: • Knowledge of the geometry of the vocal tract • Knowledge of the physics of the speech generation process • Approach Followed: – Compute realistic vocal tract shapes from EMA data 1. Adapting Maeda’s geometric vocal tract model to EMA data 2. Search for best fit of the tongue and lips profile contours to EMA data – Synthesize speech from vocal tract shapes 3. Articulatory synthesis using the Sondhi and Schroeter model 10 1. Vocal Tract Adaptation Parameters 6 4 y cm Lips d Upper Wall 21 Upper Incisor 2 29 Tongue 0 10 Origin • • • • • Origin Upper Wall Shift θ d Lips Separation -2 Inner Wall -4 2 1 Larynx Edges -6 -8 -6 -4 -2 0 x cm 2 4 6 11 Adaptation Result [1] 4 0 UL UL + Estimated EMA Upper Wall d 2 UI UI 29 UI Maeda Upper Wall 15 VL VL 26 TT TB TD LI LI -2 y cm LLLL -4 Inner Wall -6 2 1 -8 Larynx -10 -2 0 2 4 x cm 6 [1] Z. Al Bawab, L. Turicchia, R. M. Stern, and B. Raj, “Deriving Vocal Tract Shapes From ElectroMagnetic Articulograph Data Via Geometric Adaptation and Matching,” in Interspeech, Brighton, UK, September 2009. 8 12 10 2. Search Results EMA points in purple for phone ‘II’ as in “Seesaw = /S-II-S-OO/” EMA points in purple for phone ‘@@’ as in “Working = /W-@@K-I-NG/” 13 3. Synthesis Results Acoustic tubes model for phone ‘II’ as in “Seesaw = /S-II-S-OO/” Acoustic tubes model for phone ‘@@’ as in “Working = /W-@@K-I-NG/” 14 Creating a Realistic Codebook and Adapted Articulatory Transfer Functions Velum Area Codeword: p1 p2 p3 p4 p5 p6 p7 VA 15 Projecting the 44 Phones Codewords’ Means using Multi-Dimensional Scaling (MDS) 1.5 II 1 NG EI I@ GK JH SH E E@ A I ZH 0.5 y CH Y H AI AU UU 0 U U@ @@ L AA RUH @ N T D SZ DH TH -0.5 O OU OI -1 FV OO W PB -1.5 -2.5 -2 -1.5 -1 -0.5 x 0 M 0.5 1 1.5 16 Deriving Analysis-by-Synthesis Features[2] Compare signals generated from a codebook of valid vocal tract configurations Energy, Pitch to the incoming signal to produce a “distortion” feature vector Articulatory Space Speech MFCC codeword 1 P1 Distortion Feature Vector Synthesis MFCC P7 d1 Mel-Cepstral Distortion Articulatory Configurations codeword N dN P1 Synthesis MFCC P7 [2] Z. Al Bawab, B, Raj, and R. M. Stern, “Analysis-by-synthesis features for speech recognition,” IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2008, Las Vegas, Nevada. 17 Mixture Probability Density Function • For a given frame, the output probability of each state in the HMM is a mixture density over a set of M codewords: Weight of each codeword Likelihood of input given the codeword and state 18 HMM Framework 19 Priors From EMA EMA measurements TT TB TD Time cd1 cd2 cd1 cd3 cd2 20 Update Equations • For each phone, we estimate and for each state as: P( xu | cd j ) j exp 21 2 j d uj Weights for Phone ‘OU’ Projected on Codewords-MDS Space 4 3 2 1 0 -1 -2 -3 -4 -6 Weights Flat Init y y Priors from EMA -4 -2 0 2 4 4 3 2 1 0 -1 -2 -3 -4 -6 y y x Weights Init from EMA -4 -2 0 x 2 4 4 3 2 1 0 -1 -2 -3 -4 -6 -4 -2 0 2 4 x Weights Init from EMA + Adaptation 4 3 2 1 0 -1 -2 -3 -4 -6 -4 -2 0 2 4 x 22 Experimental Setup • Segmented phone recognition on the MOCHA Database (9 speakers, 460 TIMIT British English utterances per speaker, 44 phones) • Articulatory codebook composed of 1024 different Maeda configurations derived from MOCHA EMA data • LDA dimensionality reduction of the distortion vector to 20 features per frame, phones being the classes of transformation 23 Experimental Setup Cont’d • Distortion measure used is the Mel-Cepstral distortion: 12 10 MCD(C incoming, C synth) 2 (Cincoming(k ) Csynth(k )) 2 ln 10 k 1 • Classify each phone c according to: ˆc arg max c P(c) P(MFCC | c) P( DF | c)(1 ) 24 Summary of Phone Error Rates Results [3] Features (dimension) MFCC + CMN (13) Dist Feat (1024) (Prob. Combination α = 0.2) Dist Feat(1024) (Prob. Combination α = 0.2) Adapted Dist Feat (1024) (Prob. Combination α = 0.25) Dist Feat + LDA + CMN (20) (Prob. Combination α = 0.6) Topology Obser Prob / Init 3S-128M-HMM Gaussian/VQ 3S-1024M-HMM Exponential/Flat Sparsity = 21% 3S-1024M-HMM Exponential/EMA Sparsity = 51% 3S-1024M-HMM Exponential/EMA Sparsity = 51% 3S-128M-HMM Gaussian/VQ Sparsity = 0% fsew0 14,352 msak0 14,302 Both 28,654 61.6% 55.9% 58.8% 57.6% 53.7% 55.7% 5.3% 58.3% 53.9% 56.1% 4.6% 58.4% 53.1% 55.7% 5.3% 54.9% 49.8% 52.4% 10.9% [3] Z. Al Bawab, B, Raj, and R. M. Stern, “A Hybrid Physical and Statistical Dynamic Articulatory Framework Incorporating Analysis-by-Synthesis for Improved Phone Classification,” Submitted to ICASSP 2010, Dallas, Texas. 25 Improvement Summary of our Contribution Conventional HMM Production Based HMM Abstract, no physical meaning Real articulatory configurations Gaussian probability using acoustic features Exponential probability based on the analysis-bysynthesis distortion features Adaptation VTLN, MLLR, MAP Vocal tract geometric model adaptation Transition Probability Based on acoustic observation Can be leaned from articulatory dynamics States Output Observation Probability 26 Conclusion • A model that mimics the actual physics of the vocal tract results in better classification performance • Developed a hybrid physical and statistical dynamic articulatory framework that incorporates analysis-bysynthesis for improved phone classification • Recent databases open new horizons to better understand the articulatory phenomena • Current advancements in computations and machine learning algorithms facilitate the integration of physical models in large scale systems 27 • THANK YOU 28