Presentation

Technion - IIT
Dept. of Electrical Engineering
Signal and Image Processing lab
Speech Coding at Very Low
Bit Rates
David Malah
Ronen Mayrench
June 30, 2003
Orit Lev
Slava Shechtman
IBM Speech Technology Seminar
Outline
• Introduction
• Coder – I (Long term model - LTM)
• Coder – II (Trellis-based joint segmentationquantization - TSQ)
• Coder – III (Temporal Decomposition - TD)
• Summary
Introduction
• Existing medium delay (~ 50 msec) low bit rate standard
coders (vocoders) operate at 2400 bps (e.g., LPC-10,
MELP-2400).
• The interst in further rate reduction for applications
allowing long delays (more than 200 msec) continues.
E.g., Half-duplex (military) Communication and Voice Mail.
• Speech quasi-stationarity interval is 20–40 msec long,
so to utilize longer allowed delays new models and
approaches are needed, as addressed in this presentation.
• Recent developments: MELP-1200 and efforts to develop
a 600 bps coder (by NATO).
Coder – I (LTM)
Low Bit Rate Speech Coding
Based on a Long Term Model
Orit Lev (Fellah)
David Malah
Long Term Model (LTM)
For Voiced Speech [Stettiner et. al., 1994]
Sp(u)
Time
Varying
Spectral
Shaping
G(u)
Spp(u)
S(u)
Warping
Function
u=Φ
u= (t)
k=K(t)
S(t) = G(Φ(t))*∑ Ck(Φ(t))*exp(jkΦ(t))
k=0
|-------- Spp(u=Φ(t)) -------|
Sp (u) - Periodic impulse train
Spp(u), S(u) - Pseudo-Periodic signals
S(t)
- Voiced Speech
t - Time variable
u - Warped time variable
Voiced Speech
S(t)
LTM-based Encoder
Speech Segment
Inverse
Warping
Pseudo-periodic
Signal
Φ(t)
Pitch
detector
Comb
Ave. pitch
Periodic
Approx.
Filter
G(u)
Reconst.
Speech
Φ(t)
Warping
Function
Calculation –
DP
Warping
Φ(t)
Gain
-1
Q
Q
Prototypes
Data
Reconst. pseudo-periodic signal
Warping function
.
.
Section of
Periodic signal
offset
Speech Section
. .
u
Warping Function Φ
t
DTW
• Speech Section: 128 samples.
• Periodic signal section: 15 possible values: 121 … 135.
Coding
• Offset: 7 bits.
• Each segment slope: 4 bits.
Typical Waveforms
Sw(u)
Approx.
[Samples]
Pseudo-periodic Signal
LP-Residual Prototypes
PES – Prototype Evolution Shapes
Encoding Pseudo-periodic Signal
Pseudo-periodic
Signal
LPC
Partitioning
into
Prototypes
Inverse
Filtering
Residual
PES
Generation
Prototypes
Coded PES
PES
Encoding
LPC
Analysis
LPC
Coeffs
LPC
Quant
Coded
LPC Coeffs
Voiced Frame Decoding
prototypes
PES
IDFTs
Reconstructed
Pseudo-periodic
signal
Coded Residual
Concat.
Prototypes
LPC
Synthesis
Gain
Av. Pitch
LPC Coeffs
Gain
Φ(t)
Warping
Reconst.
Speech
Speech Segmentation - Example
LTM-Based Speech Encoder
Voiced Frame
LTM
Analysis
Frame
Concat.
64-160 msec.
Speech
V/UV/Silence
Decision
Unvoiced
Frame
32 msec
LPC 10
Analysis
Pitch
s(u)
Φt)
LPC
Quant.
Silence Frame
Energy
Estimation
Energy
V/UV/Silence Decision
LTM-based Speech Decoder
Pitch
s(u)
Φ(t)
LPC
Coeffs
De Quant
Energy
LTM
Syn.
LPC 10
Syn.
White Gaussian
noise generator
V/UV/Silence decision
Decoded
voiced
segment
Decoded
unvoiced
frame
Decoded
silence
frame
.
Overlap
and add
Decoded
Speech
Bit Allocation - Average Voiced Segment
Parameter
Warping function - Φ
LPC Coefficients
RES
PES
(6*4 + 7)
(VQ Tree)
MPM
Phase
(VQ)
(VQ Tree)
Gain
Average Pitch
(VQ)
V/UV/Silence Decision
Total
Bits
31
24
60
10
40
10
7
2
184
Average voiced segment: 6 x 128-64=704 samples, 88 ms
Rate: 184[bits] / 88[ms] Æ 2.1[Kbps]
Demonstration
Original
Male
Female
Quantized
Coder – II (TSQ)
Low Bit-Rate Speech Coding Using
Joint Segmentation-Quantization
Ronen Mayrench
David Malah
Introduction
• Standard LBR coders (2400 – 4800 bps) invest a significant
part of the bit budget in coding the spectral envelope.
LPC Analysis
25 bit
Voicing Analysis
4+1 bit
Quantization
Speech
Gain Analysis
8 bit
Pitch Analysis
7 bit
Fourier Magnitude
Analysis
8 bit
MELP Encoder
• This motivates addressing the efficient representation
of spectral parameters over longer segments.
Approaches to bit-rate reduction
Selective Frame Transmission
- Alterante frame transmission - AF
[Roucos, 1983].
1
2
3
4
5
Segment Quantization
- Fixed length segments
(Matrix Quantization - MQ) [Gray, 1985]
6
1
2
3
4
5
6
- Adaptive Trellis-based Frame Selection
(Trellis Quantization - TQ) [George, 1996]
Block i
Block i+1
1
2
3
1
2
3
4
2
3
4
5
3
4
5
6
4
1
2
3
4
5
6
- Variable length segments [Shiraki, 1983].
1
2
3
4
5
6
Proposed Scheme
• Combining Selective Frame Transmission (frame
skipping) and Segment Quantization (frame merging).
- Provides a richer partition set
- MQ and TQ are specific cases
• Fixed rate is obtained by selecting M segments
from a block of N frames.
Example for N=6 and M=2
1
2
X13
3
4
5
6
X56
Note: Skipped frames are linearly interpolated.
Trellis-based Joint SegmentationQuantization (TSQ)
• A Trellis is used to optimally (min. quantization error) select
the M segments from the N frames
Example for M=2, N=4
Stage 0
Stage 1
X11
X22
X11
X12
X23
X12
X13
X24
X13
X22
X33
X22
X23
X34
X23
X44
X33
Block n-1
Stage 0
Block n
X33
Block n+1
Melp-based 1200 bps TSQ Coder
Speech
MELP
Analysis
LSF
Buffer
TSQ
Voicing
Quantization
LSF
22*3
Gain
5*6
Pitch
7*6
V/UV
2*6
Path
9
Gain, Pitch
Quantization
159 bits per 6 frames (26.5 bits per frame)
N=6 frames, M=3 segments, frame rate=44.44 f/s (frame size 22.5 msec),
Total Rate: 1178bps ( unused bits can synchronize frames)
H
Simulation Results (cont’d)
Quantization
Original LSF (blue), VQ-22 bit (green) and TSQ-11 bit (red)
3
2 5
2
LSF
„
1 5
1
0 5
0
0
10
20
30
40
50
60
Simulation Results
Log Spectral Distortion (LSD) with Gardner’s
Weighting Matrix and Split-VQ
dWMSE (a,aˆ ) = (a − aˆ )T Wa (a − aˆ )
Full
(22bit)
AF
(11 bit)
MQ
(11 bit)
1.67
2.43
2.41
TQ
TSQ
(11 bit) (11 bit)
2.23
2.01
[dB]
A-B Comparison Tests
60
60
50
%
V
O
T
E
S
40
30
20
10
1
50
otes
V
O
T
E
S
70
otes
%
70
40
30
20
1 2 1 41 61 8 2 2 22 42 6 2 8 3
it ate b s
Bit-Rate [Kbps]
10
1
1 2 1 41 61 8 2 2 22 42 62 8 3
it ate b s
Bit-Rate [Kbps]
Demonstration
Male
Original
LPC10 (2400 bps)
MELP (2400 bps)
MELP-AF (1200 bps)
MELP-TQ (1200 bps)
MELP-TSQ (1200 bps)
Female
Coder – III (TD)
Very Low Bit Rate Coding using
Temporal Decomposition
Slava Shechtman
David Malah
Temporal Decomposition
[Atal, 1982]
Representing a sequence of N input vectors by a smaller set of
M representative (event or target vectors) and locally centered
Interpolation functions (event functions).
Y
#

 y1
#

#
y2
#
WMSE
≈
# 
#
" y N 
≈  a1
#
#  P×N

×
Φ
# 
" a M 
#  P×M
" φ1
" φ
2


#

" φM
A
#
a2
#
"
"



" M ×N
1
0.5
0
Reduced TD - only adjacent event functions
may overlap:
yˆ n = amφm,n + am+1φm+1,n , nm ≤ n < nm+1
0
5
10
15
20
25
30
35
40
45
Temporal Decomposition Solution
• A simple, sub-optimal (in MSE sense), iterative algorithm for solving
the reduced TD problem is used [Athaudage, 1999].
Extensions
- WMSE criterion, with time-dependent weights
- Added constraints (1’s complementary, non-negative, monotonic)
Search Range
Initial Event locations
φm
Training
a m = y nm
Sub-optimal event function
determination (DP)
Target refinement
nm−1
nm
nm+1
Max. 2 iterations
TD Solution with Quantization
Quantization
Initial Event locations
a m = Q ( y nm )
Sub-optimal event-function
determination (DP)
Target refinement
Max. 2 iterations
Target quantization
Target-vectors
Codebook
Event-functions
Codebook
Excitation Representation using TD
• Excitation vector: Pitch, MVF, Energy
(3-componet vector).
• Weighting of approximation error (input
dependent using v/uv information).
• Scalar quantization of excitation parameters.
MBE-based 600–800 bps TD Coder
Energy
Estimation
Pitch
Estimation
Speech
Joint Excitation TD
Quantization
& Quantization
Voicing
Decisions
Band
separation
Via MVF
MBE – Multiband Excitation
~ 4 kbps
Magnitude
Estimation
Via All-pole
modeling
~ 2 kbps
AR to
LSF
Joint LSF TD &
Quantization
Quantization
Quantization
~ 0.6 -0.8 kbps
Bit Assignment
(current status)
• Frame update: 10 - 20 msec (computational issue)
• Spectral and Excitation Events: 4 per block of 300 msec (13.333 events/sec)
bits
Spectral
Magnitudes
bits
Energy
5
LSF Vect. (Split-VQ)
22 - 24
Pitch
6
Voicing (MVF)
3
Event Funct. Shape
2
Ev.Func. Shape
4
Event Funct. Length
2
Ev.Func. Length
3
Excitation
Total bits per
Excitation event
Total rate: 676 - 753 bps
18
Total bits per
Spectral event
29 - 31
(including 50 bps of V/UV information)
Demonstration
(partial)
Original (4 sentences – F & M)
Spectral Envelope Coded at 306 bps
(original excitation)
TD Algorithm formulae
• Optimal 1’s complementary event functions (given targets):
φ
*
m ,n
=
( y n −am +1 )T Wn ( am −am+1 )
( am −am +1 )T Wn ( am −am +1 )
φ *m +1,n = 1 − φ *m ,n
•Target refinement
 Z1 X1
0
0   a1   b1 


  # 
#
%
%
0
X
 1

=
,
 0 % Z M −1 X M −1   a M −1   b M −1 

 
 

a
b
0
0
X
Z
M 
 M 
M −1
M 

Zm =
nm +1 −1
∑φ
n = nm −1
2
m ,n
Wn ,
Xm =
nm +1 −1
∑φ
n = nm
φ
m ,n m +1,n
Wn ,
bk =
nm +1 −1
∑φ
n = nm −1
m ,n
Wn y n .
Back
WMSE Criterion
• WMSE Criterion:
E ( n ) = ( y n − yˆ n )
T
 wn ,1 0 0 


ˆ
Wn ( y n − y n ) , Wn =  0 % 0 


 0 0 wn ,P 
• Paliwal-Atal weights for LSF vectors:
0.3


1
 ,
wn ,i = ci 2 
2
 A( yn ,i ) 


 1, 1 ≤ i ≤ 8

ci =  0.8 , i = 9 ,
0.4, i = 10

where
A( y)
is LPC inverse filter frequency response.
Back
Summary
• Three coders have been developed, demonstrating
different models and approaches for utilizing long
analysis segments:
- A long term model (LTM) – 2 kbps (variant of WI coder)
- Trellis-based joint segmentation and quantization (TSQ)
for efficient spectral envelope representation (MELPbased 1.2 kbps coder).
- Temporal decomposition (TD) for efficient representation of
both the spectral envelope and excitation (MBE-based
600 – 800 bps coder – expected).
• In the future, presented models and approaches may be
combined to reduce even further the rate or improve quality.
End