Presentation

Distributed Speech Recognition –
Standardization Activity
Alex Sorin, Ron Hoory, Dan Chazan
Telecom and Media Systems Group
June 30, 2003
IBM Research Lab in Haifa
IBM Haifa Research Lab
Advanced Speech Enabled Services
ASR
App DB
Flight Booking
Center!
How can I help
you?
I’d like to
fly from NY
to Chicago
TTS
• Low resource mobile device
• Advanced ASR task where accuracy is crucial
• Server based ASR
2
IBM Haifa Research Lab
Speech Recognition Over Mobile Networks - NSR vs DSR
Network Speech Recognition (NSR)
• Low bitrate coding and
channel errors degrade
speech recognition
accuracy
Playback
Encoder
Encoded Voice
Decoder
ASR
Front-end
ASR
Back-end
Text
Speech recognition
server
Distributed Speech Recognition (DSR)
• Compress & transmit
recognition features
(MFCC)
• “To do” list
”Proof of concept
”Standardization
”Speech reconstruction
3
Playback
ASR
Front-End
DSR
Encoder
Recognition features
low bit-rate stream
DSR
Decoder
ASR
Back-end
Text
IBM Haifa Research Lab
DSR in ETSI/Aurora
• 2000 – standard ETSI ES 201 108 DSR Front-end (FE)
• 2002 – standard ETSI ES 202 050 DSR Advanced Front-end (AFE)
• Sept 2003 – extended standards developed by IBM HRL & Motorola
ETSI ES 202 211 (XFE) and ETSI ES 202 212 (XAFE)
Speech
Reconstruction
Robust
Speech
Features
Extraction
4
To playback
Compression
Pitch
Pitch & VC
Voicing Class
compression
Wireless
channel
Decompression
ASR
Text
Back-end
Tonal Lang ASR
Back-end
Text
IBM Haifa Research Lab
ASR Accuracy Comparison DSR vs GSM AMR
• In-car records, 5 languages, connected digits task
• AMR 4.75 kbps is 55% worse than DSR AFE
16
14
WER %
12
10
8
6
4
2
0
AMR 4.75 kbps
5
AMR 12.2 kbps
DSR AFE 4.8 kbps
IBM Haifa Research Lab
Channel Errors
Word Accuracy (%)
• DSR error mitigation – interpolation in feature space
• Simulation by BT – 3% accuracy degradation at 50% packet lost vs.
63% degradation for coded speech transmission
100
90
80
70
60
50
40
30
20
10
0
G723.1
Front-End
Interpolation
0
5
10
20
30
Packet Loss (%)
6
40
50
IBM Haifa Research Lab
DSR XFE/XAFE Requirements
7
St
a
te
of
TL
th
ea
R
AS
rt
R
Ac
cu
im
ra
pr
cy
ov
em
en
t
& bitrate
in Spe
te
lli ech
gi
bi
lit
y
complexity
LP
M C10
EL
P
GSM AMR 4.75
IBM Haifa Research Lab
Extended DSR Client Diagram
Speech
Feature Extraction
Spectrum
Down sampling
High Pass
Cepstra
VAD
Car noise detection
Pitch Estimation
Compression
Voicing classification
xCompression
Packing
8
IBM Haifa Research Lab
Robust Low Complexity Pitch Estimator
Spectrum
Spectral Peaks
Estimate location and amplitudes of
spectral peaks
Preliminary
Use a few major peaks to find
preliminary pitch candidates
Car noise flag
Candidates
Candidates
Downsampled speech
Correlation
Scores
Decision
Logic
Pitch
9
Use all peaks to determine a few best
candidates and their spectral
scores
Compute correlation scores of the
candidates
Select final pitch candidate using
spectral scores & correlation score
& history
IBM Haifa Research Lab
Pitch Contours Example – Clean vs Babble Noise 10dB
50
45
40
35
30
25
20
15
10
5
0
10
0
500
1000
1500
2000
2500
IBM Haifa Research Lab
xDSR Encoder Parameters
• Bitrate 4.8 + 0.8 = 5.6 kbps
ROM
15
10
CPU
kWords
5
0
20
XFE
15
Basis
AMR
XAFE
Extension
wMOPS 10
RAM
5
0
8
XFE
AMR
XAFE
6
Basis
Extension
kWords 4
2
0
XFE
AMR
Basis
11
Extension
XAFE
IBM Haifa Research Lab
Server Side Speech Reconstruction
Raw pitch
Pitch Tracking
Voicing class
Cepstra & energy
Control / Harmonic Structure Init
The heart of the
reconstruction
process
Harmonic Magnitudes
All-pole Modelling
Postfilter
Voiced Phase
Unvoiced Phase
Voice & Unvoice Combination
Line spectrum Î Time domain
OLA
12
Synthesized speech
IBM Haifa Research Lab
Magnitudes Reconstruction Problem
13 MFCC
23 bins
Speech
Abs/Power
STFT
Mel Scale
128
LOG
Triangular Filters
012
23
DCT
Quantization
20 21 23
…
Freq
???
S( f ) =
∑CW(f
i
Ci = Ai e jϕ i
− fi )
i
13 MFCC
(LOC)
S(f)
C1
f0
13
Harmonic magnitudes {Ai}
C3
C2
f1
f2
f
IBM Haifa Research Lab
Magnitudes Reconstruction by IBM and by Motorola
• Convert cepstra to spectral bins
(IDCT Î exp)
• Describe front-end processing by
linear equation linking bins with
harmonic magnitudes
• Represent magnitudes by linear
combination of 23 basis functions
• Rewrite and solve the equation in
basis function weights
• Compute magnitudes
• Find (non-integer) index αk of
harmonic frequency location at Melchannels grid: 0.5 < αk < 23.5
• Extended IDCT
LAk =
2 12
π

Cepn ⋅ cos ⋅ n ⋅ (α k − 0.5) , k = 1,..., N harm
∑
23 n=0
 23

• Take exponent
• Normalize to compensate variable
width of Mel-triangles
• 23 dimensional cepstrum - IBM outperforms Motorola
• Quantized 13 cepstra – IBM and Motorola performs equally
• Cepstra truncation significantly degrade reconstruction accuracy
14
IBM Haifa Research Lab
Combined Magnitudes Reconstruction
Pitch
LOC – Low Order Cepstra (C0…C12)
HOC – High Order Cepstra(C13…C23)
HOC
Pitch
LOC
Pitch
HOC
LOC
HOC
HOC1
HOC2
HOC3
…
HOCk
Motorola
IBM
Algorithm
Algorithm
…
HOCN
AMot
A IBM
Ak = µ (k , pitch ) ⋅ AkIBM + (1 − µ (k , pitch )) ⋅ AkMot
A
15
HOC
Synthesis
IBM Haifa Research Lab
Magnitudes Reconstruction Accuracy Evaluation
Magnitudes reconstruction accuracy
0.4
0.35
Relative reconstruction error
0.3
0.25
0.2
0.15
0.1
0.05
2
4
IBM
16
6
8
10
Pitch, ms
Motorola
12
14
16
Combined
18
IBM Haifa Research Lab
Intelligibility of Reconstructed Speech
• Average over background noise conditions: clean, car, street, babble
DRT WER% , 10 * TT WER%
Intelligibility testing results
60
50
40
30
20
10
0
PCM
XAFE
XFE
Diagnostic Rhyme Test (DRT)
17
MELP
Transcription Test (TT)
LPC-10
IBM Haifa Research Lab
Decoded Speech Examples
Coder
Original
LPC10
MELP
XFE
XAFE
18
Female voice
Male voice
IBM Haifa Research Lab
Tonal Language Recognition Evaluation
• Standard pitch keeps state of the art TLR performance intact
TLR Evaluation by IBM
40
35
30
25
20
15
10
5
0
5
4
WER %
WER %
TLR Evaluation by Motorola
2
1
0
Mandarin digits
Mandarin commands
Proprietary pitch
19
3
Cantonese digits
Standard pitch
Mandarin digits
Proprietary pitch
Cantonese digits
Standard pitch