Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Speech Coding at Very Low Bit Rates David Malah Ronen Mayrench June 30, 2003 Orit Lev Slava Shechtman IBM Speech Technology Seminar Outline • Introduction • Coder – I (Long term model - LTM) • Coder – II (Trellis-based joint segmentationquantization - TSQ) • Coder – III (Temporal Decomposition - TD) • Summary Introduction • Existing medium delay (~ 50 msec) low bit rate standard coders (vocoders) operate at 2400 bps (e.g., LPC-10, MELP-2400). • The interst in further rate reduction for applications allowing long delays (more than 200 msec) continues. E.g., Half-duplex (military) Communication and Voice Mail. • Speech quasi-stationarity interval is 20–40 msec long, so to utilize longer allowed delays new models and approaches are needed, as addressed in this presentation. • Recent developments: MELP-1200 and efforts to develop a 600 bps coder (by NATO). Coder – I (LTM) Low Bit Rate Speech Coding Based on a Long Term Model Orit Lev (Fellah) David Malah Long Term Model (LTM) For Voiced Speech [Stettiner et. al., 1994] Sp(u) Time Varying Spectral Shaping G(u) Spp(u) S(u) Warping Function u=Φ u= (t) k=K(t) S(t) = G(Φ(t))*∑ Ck(Φ(t))*exp(jkΦ(t)) k=0 |-------- Spp(u=Φ(t)) -------| Sp (u) - Periodic impulse train Spp(u), S(u) - Pseudo-Periodic signals S(t) - Voiced Speech t - Time variable u - Warped time variable Voiced Speech S(t) LTM-based Encoder Speech Segment Inverse Warping Pseudo-periodic Signal Φ(t) Pitch detector Comb Ave. pitch Periodic Approx. Filter G(u) Reconst. Speech Φ(t) Warping Function Calculation – DP Warping Φ(t) Gain -1 Q Q Prototypes Data Reconst. pseudo-periodic signal Warping function . . Section of Periodic signal offset Speech Section . . u Warping Function Φ t DTW • Speech Section: 128 samples. • Periodic signal section: 15 possible values: 121 … 135. Coding • Offset: 7 bits. • Each segment slope: 4 bits. Typical Waveforms Sw(u) Approx. [Samples] Pseudo-periodic Signal LP-Residual Prototypes PES – Prototype Evolution Shapes Encoding Pseudo-periodic Signal Pseudo-periodic Signal LPC Partitioning into Prototypes Inverse Filtering Residual PES Generation Prototypes Coded PES PES Encoding LPC Analysis LPC Coeffs LPC Quant Coded LPC Coeffs Voiced Frame Decoding prototypes PES IDFTs Reconstructed Pseudo-periodic signal Coded Residual Concat. Prototypes LPC Synthesis Gain Av. Pitch LPC Coeffs Gain Φ(t) Warping Reconst. Speech Speech Segmentation - Example LTM-Based Speech Encoder Voiced Frame LTM Analysis Frame Concat. 64-160 msec. Speech V/UV/Silence Decision Unvoiced Frame 32 msec LPC 10 Analysis Pitch s(u) Φt) LPC Quant. Silence Frame Energy Estimation Energy V/UV/Silence Decision LTM-based Speech Decoder Pitch s(u) Φ(t) LPC Coeffs De Quant Energy LTM Syn. LPC 10 Syn. White Gaussian noise generator V/UV/Silence decision Decoded voiced segment Decoded unvoiced frame Decoded silence frame . Overlap and add Decoded Speech Bit Allocation - Average Voiced Segment Parameter Warping function - Φ LPC Coefficients RES PES (6*4 + 7) (VQ Tree) MPM Phase (VQ) (VQ Tree) Gain Average Pitch (VQ) V/UV/Silence Decision Total Bits 31 24 60 10 40 10 7 2 184 Average voiced segment: 6 x 128-64=704 samples, 88 ms Rate: 184[bits] / 88[ms] Æ 2.1[Kbps] Demonstration Original Male Female Quantized Coder – II (TSQ) Low Bit-Rate Speech Coding Using Joint Segmentation-Quantization Ronen Mayrench David Malah Introduction • Standard LBR coders (2400 – 4800 bps) invest a significant part of the bit budget in coding the spectral envelope. LPC Analysis 25 bit Voicing Analysis 4+1 bit Quantization Speech Gain Analysis 8 bit Pitch Analysis 7 bit Fourier Magnitude Analysis 8 bit MELP Encoder • This motivates addressing the efficient representation of spectral parameters over longer segments. Approaches to bit-rate reduction Selective Frame Transmission - Alterante frame transmission - AF [Roucos, 1983]. 1 2 3 4 5 Segment Quantization - Fixed length segments (Matrix Quantization - MQ) [Gray, 1985] 6 1 2 3 4 5 6 - Adaptive Trellis-based Frame Selection (Trellis Quantization - TQ) [George, 1996] Block i Block i+1 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 1 2 3 4 5 6 - Variable length segments [Shiraki, 1983]. 1 2 3 4 5 6 Proposed Scheme • Combining Selective Frame Transmission (frame skipping) and Segment Quantization (frame merging). - Provides a richer partition set - MQ and TQ are specific cases • Fixed rate is obtained by selecting M segments from a block of N frames. Example for N=6 and M=2 1 2 X13 3 4 5 6 X56 Note: Skipped frames are linearly interpolated. Trellis-based Joint SegmentationQuantization (TSQ) • A Trellis is used to optimally (min. quantization error) select the M segments from the N frames Example for M=2, N=4 Stage 0 Stage 1 X11 X22 X11 X12 X23 X12 X13 X24 X13 X22 X33 X22 X23 X34 X23 X44 X33 Block n-1 Stage 0 Block n X33 Block n+1 Melp-based 1200 bps TSQ Coder Speech MELP Analysis LSF Buffer TSQ Voicing Quantization LSF 22*3 Gain 5*6 Pitch 7*6 V/UV 2*6 Path 9 Gain, Pitch Quantization 159 bits per 6 frames (26.5 bits per frame) N=6 frames, M=3 segments, frame rate=44.44 f/s (frame size 22.5 msec), Total Rate: 1178bps ( unused bits can synchronize frames) H Simulation Results (cont’d) Quantization Original LSF (blue), VQ-22 bit (green) and TSQ-11 bit (red) 3 2 5 2 LSF 1 5 1 0 5 0 0 10 20 30 40 50 60 Simulation Results Log Spectral Distortion (LSD) with Gardner’s Weighting Matrix and Split-VQ dWMSE (a,aˆ ) = (a − aˆ )T Wa (a − aˆ ) Full (22bit) AF (11 bit) MQ (11 bit) 1.67 2.43 2.41 TQ TSQ (11 bit) (11 bit) 2.23 2.01 [dB] A-B Comparison Tests 60 60 50 % V O T E S 40 30 20 10 1 50 otes V O T E S 70 otes % 70 40 30 20 1 2 1 41 61 8 2 2 22 42 6 2 8 3 it ate b s Bit-Rate [Kbps] 10 1 1 2 1 41 61 8 2 2 22 42 62 8 3 it ate b s Bit-Rate [Kbps] Demonstration Male Original LPC10 (2400 bps) MELP (2400 bps) MELP-AF (1200 bps) MELP-TQ (1200 bps) MELP-TSQ (1200 bps) Female Coder – III (TD) Very Low Bit Rate Coding using Temporal Decomposition Slava Shechtman David Malah Temporal Decomposition [Atal, 1982] Representing a sequence of N input vectors by a smaller set of M representative (event or target vectors) and locally centered Interpolation functions (event functions). Y # y1 # # y2 # WMSE ≈ # # " y N ≈ a1 # # P×N × Φ # " a M # P×M " φ1 " φ 2 # " φM A # a2 # " " " M ×N 1 0.5 0 Reduced TD - only adjacent event functions may overlap: yˆ n = amφm,n + am+1φm+1,n , nm ≤ n < nm+1 0 5 10 15 20 25 30 35 40 45 Temporal Decomposition Solution • A simple, sub-optimal (in MSE sense), iterative algorithm for solving the reduced TD problem is used [Athaudage, 1999]. Extensions - WMSE criterion, with time-dependent weights - Added constraints (1’s complementary, non-negative, monotonic) Search Range Initial Event locations φm Training a m = y nm Sub-optimal event function determination (DP) Target refinement nm−1 nm nm+1 Max. 2 iterations TD Solution with Quantization Quantization Initial Event locations a m = Q ( y nm ) Sub-optimal event-function determination (DP) Target refinement Max. 2 iterations Target quantization Target-vectors Codebook Event-functions Codebook Excitation Representation using TD • Excitation vector: Pitch, MVF, Energy (3-componet vector). • Weighting of approximation error (input dependent using v/uv information). • Scalar quantization of excitation parameters. MBE-based 600–800 bps TD Coder Energy Estimation Pitch Estimation Speech Joint Excitation TD Quantization & Quantization Voicing Decisions Band separation Via MVF MBE – Multiband Excitation ~ 4 kbps Magnitude Estimation Via All-pole modeling ~ 2 kbps AR to LSF Joint LSF TD & Quantization Quantization Quantization ~ 0.6 -0.8 kbps Bit Assignment (current status) • Frame update: 10 - 20 msec (computational issue) • Spectral and Excitation Events: 4 per block of 300 msec (13.333 events/sec) bits Spectral Magnitudes bits Energy 5 LSF Vect. (Split-VQ) 22 - 24 Pitch 6 Voicing (MVF) 3 Event Funct. Shape 2 Ev.Func. Shape 4 Event Funct. Length 2 Ev.Func. Length 3 Excitation Total bits per Excitation event Total rate: 676 - 753 bps 18 Total bits per Spectral event 29 - 31 (including 50 bps of V/UV information) Demonstration (partial) Original (4 sentences – F & M) Spectral Envelope Coded at 306 bps (original excitation) TD Algorithm formulae • Optimal 1’s complementary event functions (given targets): φ * m ,n = ( y n −am +1 )T Wn ( am −am+1 ) ( am −am +1 )T Wn ( am −am +1 ) φ *m +1,n = 1 − φ *m ,n •Target refinement Z1 X1 0 0 a1 b1 # # % % 0 X 1 = , 0 % Z M −1 X M −1 a M −1 b M −1 a b 0 0 X Z M M M −1 M Zm = nm +1 −1 ∑φ n = nm −1 2 m ,n Wn , Xm = nm +1 −1 ∑φ n = nm φ m ,n m +1,n Wn , bk = nm +1 −1 ∑φ n = nm −1 m ,n Wn y n . Back WMSE Criterion • WMSE Criterion: E ( n ) = ( y n − yˆ n ) T wn ,1 0 0 ˆ Wn ( y n − y n ) , Wn = 0 % 0 0 0 wn ,P • Paliwal-Atal weights for LSF vectors: 0.3 1 , wn ,i = ci 2 2 A( yn ,i ) 1, 1 ≤ i ≤ 8 ci = 0.8 , i = 9 , 0.4, i = 10 where A( y) is LPC inverse filter frequency response. Back Summary • Three coders have been developed, demonstrating different models and approaches for utilizing long analysis segments: - A long term model (LTM) – 2 kbps (variant of WI coder) - Trellis-based joint segmentation and quantization (TSQ) for efficient spectral envelope representation (MELPbased 1.2 kbps coder). - Temporal decomposition (TD) for efficient representation of both the spectral envelope and excitation (MBE-based 600 – 800 bps coder – expected). • In the future, presented models and approaches may be combined to reduce even further the rate or improve quality. End