Distributed Speech Recognition – Standardization Activity Alex Sorin, Ron Hoory, Dan Chazan Telecom and Media Systems Group June 30, 2003 IBM Research Lab in Haifa IBM Haifa Research Lab Advanced Speech Enabled Services ASR App DB Flight Booking Center! How can I help you? I’d like to fly from NY to Chicago TTS Low resource mobile device Advanced ASR task where accuracy is crucial Server based ASR 2 IBM Haifa Research Lab Speech Recognition Over Mobile Networks - NSR vs DSR Network Speech Recognition (NSR) Low bitrate coding and channel errors degrade speech recognition accuracy Playback Encoder Encoded Voice Decoder ASR Front-end ASR Back-end Text Speech recognition server Distributed Speech Recognition (DSR) Compress & transmit recognition features (MFCC) “To do” list Proof of concept Standardization Speech reconstruction 3 Playback ASR Front-End DSR Encoder Recognition features low bit-rate stream DSR Decoder ASR Back-end Text IBM Haifa Research Lab DSR in ETSI/Aurora 2000 – standard ETSI ES 201 108 DSR Front-end (FE) 2002 – standard ETSI ES 202 050 DSR Advanced Front-end (AFE) Sept 2003 – extended standards developed by IBM HRL & Motorola ETSI ES 202 211 (XFE) and ETSI ES 202 212 (XAFE) Speech Reconstruction Robust Speech Features Extraction 4 To playback Compression Pitch Pitch & VC Voicing Class compression Wireless channel Decompression ASR Text Back-end Tonal Lang ASR Back-end Text IBM Haifa Research Lab ASR Accuracy Comparison DSR vs GSM AMR In-car records, 5 languages, connected digits task AMR 4.75 kbps is 55% worse than DSR AFE 16 14 WER % 12 10 8 6 4 2 0 AMR 4.75 kbps 5 AMR 12.2 kbps DSR AFE 4.8 kbps IBM Haifa Research Lab Channel Errors Word Accuracy (%) DSR error mitigation – interpolation in feature space Simulation by BT – 3% accuracy degradation at 50% packet lost vs. 63% degradation for coded speech transmission 100 90 80 70 60 50 40 30 20 10 0 G723.1 Front-End Interpolation 0 5 10 20 30 Packet Loss (%) 6 40 50 IBM Haifa Research Lab DSR XFE/XAFE Requirements 7 St a te of TL th ea R AS rt R Ac cu im ra pr cy ov em en t & bitrate in Spe te lli ech gi bi lit y complexity LP M C10 EL P GSM AMR 4.75 IBM Haifa Research Lab Extended DSR Client Diagram Speech Feature Extraction Spectrum Down sampling High Pass Cepstra VAD Car noise detection Pitch Estimation Compression Voicing classification xCompression Packing 8 IBM Haifa Research Lab Robust Low Complexity Pitch Estimator Spectrum Spectral Peaks Estimate location and amplitudes of spectral peaks Preliminary Use a few major peaks to find preliminary pitch candidates Car noise flag Candidates Candidates Downsampled speech Correlation Scores Decision Logic Pitch 9 Use all peaks to determine a few best candidates and their spectral scores Compute correlation scores of the candidates Select final pitch candidate using spectral scores & correlation score & history IBM Haifa Research Lab Pitch Contours Example – Clean vs Babble Noise 10dB 50 45 40 35 30 25 20 15 10 5 0 10 0 500 1000 1500 2000 2500 IBM Haifa Research Lab xDSR Encoder Parameters Bitrate 4.8 + 0.8 = 5.6 kbps ROM 15 10 CPU kWords 5 0 20 XFE 15 Basis AMR XAFE Extension wMOPS 10 RAM 5 0 8 XFE AMR XAFE 6 Basis Extension kWords 4 2 0 XFE AMR Basis 11 Extension XAFE IBM Haifa Research Lab Server Side Speech Reconstruction Raw pitch Pitch Tracking Voicing class Cepstra & energy Control / Harmonic Structure Init The heart of the reconstruction process Harmonic Magnitudes All-pole Modelling Postfilter Voiced Phase Unvoiced Phase Voice & Unvoice Combination Line spectrum Î Time domain OLA 12 Synthesized speech IBM Haifa Research Lab Magnitudes Reconstruction Problem 13 MFCC 23 bins Speech Abs/Power STFT Mel Scale 128 LOG Triangular Filters 012 23 DCT Quantization 20 21 23 … Freq ??? S( f ) = ∑CW(f i Ci = Ai e jϕ i − fi ) i 13 MFCC (LOC) S(f) C1 f0 13 Harmonic magnitudes {Ai} C3 C2 f1 f2 f IBM Haifa Research Lab Magnitudes Reconstruction by IBM and by Motorola Convert cepstra to spectral bins (IDCT Î exp) Describe front-end processing by linear equation linking bins with harmonic magnitudes Represent magnitudes by linear combination of 23 basis functions Rewrite and solve the equation in basis function weights Compute magnitudes Find (non-integer) index αk of harmonic frequency location at Melchannels grid: 0.5 < αk < 23.5 Extended IDCT LAk = 2 12 π Cepn ⋅ cos ⋅ n ⋅ (α k − 0.5) , k = 1,..., N harm ∑ 23 n=0 23 Take exponent Normalize to compensate variable width of Mel-triangles 23 dimensional cepstrum - IBM outperforms Motorola Quantized 13 cepstra – IBM and Motorola performs equally Cepstra truncation significantly degrade reconstruction accuracy 14 IBM Haifa Research Lab Combined Magnitudes Reconstruction Pitch LOC – Low Order Cepstra (C0…C12) HOC – High Order Cepstra(C13…C23) HOC Pitch LOC Pitch HOC LOC HOC HOC1 HOC2 HOC3 … HOCk Motorola IBM Algorithm Algorithm … HOCN AMot A IBM Ak = µ (k , pitch ) ⋅ AkIBM + (1 − µ (k , pitch )) ⋅ AkMot A 15 HOC Synthesis IBM Haifa Research Lab Magnitudes Reconstruction Accuracy Evaluation Magnitudes reconstruction accuracy 0.4 0.35 Relative reconstruction error 0.3 0.25 0.2 0.15 0.1 0.05 2 4 IBM 16 6 8 10 Pitch, ms Motorola 12 14 16 Combined 18 IBM Haifa Research Lab Intelligibility of Reconstructed Speech Average over background noise conditions: clean, car, street, babble DRT WER% , 10 * TT WER% Intelligibility testing results 60 50 40 30 20 10 0 PCM XAFE XFE Diagnostic Rhyme Test (DRT) 17 MELP Transcription Test (TT) LPC-10 IBM Haifa Research Lab Decoded Speech Examples Coder Original LPC10 MELP XFE XAFE 18 Female voice Male voice IBM Haifa Research Lab Tonal Language Recognition Evaluation Standard pitch keeps state of the art TLR performance intact TLR Evaluation by IBM 40 35 30 25 20 15 10 5 0 5 4 WER % WER % TLR Evaluation by Motorola 2 1 0 Mandarin digits Mandarin commands Proprietary pitch 19 3 Cantonese digits Standard pitch Mandarin digits Proprietary pitch Cantonese digits Standard pitch