Speech Enhancement Based on the General Transfer Function GSC and Postfiltering Sharon Gannot and Israel Cohen Technion - Israeli Institute of Technology Faculty of Electrical Engineering Signal and Image Processing Laboratory June, 2003 Speech Enhancement: TF-GSC and Postfiltering Problem Problem Formulation A1 (ejω ) A2 (ejω ) s(t) ns1 (t) nt1 (t) P P ns2 (t) nt2 (t) P P nsM (t) ntM (t) P P z1 (t) z2 (t) AM (ejω ) zM (t) Z(t, ejω ) =A(ejω )S(t, ejω )+Ns(t, ejω )+Nt(t, ejω ) where £ ¤ T jω Z (t, e ) = Z1(t, ejω ) Z2(t, ejω ) · · · ZM (t, ejω ) £ ¤ T jω A (e ) = A1(ejω ) A2(ejω ) · · · AM (ejω ) ¤ £ s jω T jω jω s jω s Ns (t, e ) = N1 (t, e ) N2 (t, e ) · · · NM (t, e ) ¤ £ jω T t Nt (t, e ) = N1t(t, ejω ) N2t(t, ejω ) · · · NM (t, ejω ) . 1 Speech Enhancement: TF-GSC and Postfiltering Outline Outline Goal Find Ŝ(t, ejω ) from Z(t, ejω ) using: Beamforming and Postfiltering. • The general Transfer function GSC (TF-GSC). • Postfiltering: – Single-microphone: OM-LSA, MixMax. – Multi-microphone. • Experimental study. 2 Speech Enhancement: TF-GSC and Postfiltering Beamforming Wideband Beamformer Z1 (t, ejω ) W1∗ (t, ejω ) Z2 (t, ejω ) W2∗ (t, ejω ) Z3 (t, ejω ) W3∗ (t, ejω ) P Y (t, ejω ) ZM (t, ejω ) ∗ (t, ejω ) WM Beamformer output: Y (t, ejω ) = W†(t, ejω )Z(t, ejω ) h i jω jω where, W (t, e ) = W1(t, e ), . . . , WM (t, e ) . T jω Minimum Variance Distortionless Beamformer: min W†(t, ejω )Szz (t, ejω )W(t, ejω ) W subject to W†(t, ejω )A(ejω ) = F ∗(t, ejω ) where, Szz (t, ejω ) = E{Z(t, ejω )Z†(t, ejω )}. 3 Speech Enhancement: TF-GSC and Postfiltering Beamforming The General Transfer Function GSC Gannot, Burshtein and Weinstein, 2001 Z1 (t, ejω ) Z2 (t, ejω ) Z3 (t, ejω ) YFBF (t, ejω ) W †0 Y (t, ejω ) + P − YNC (t, ejω ) ZM (t, ejω ) U2 (t, ejω ) U3 (t, ejω ) H† G2 (t, ejω ) G3 (t, ejω ) P UM (t, ejω ) GM (t, ejω ) • Fixed beamformer (FBF)The ATFs ratios matched filter. • Blocking Matrix (BM)using ATFs ratios. • Noise canceller (NC)using multi-channel LMS. 4 Speech Enhancement: TF-GSC and Postfiltering Beamforming TF-GSC - Main Formulas 1. TF-s ratios: H(ejω ) = A(ejω ) . A1 (ejω ) H(ejω ) jω F (e ). jω 2 kH(e )k jω jω 2. Fixed beamformer W0(t, ejω ) = FBF output YFBF(t, ejω ) = W0† (e )Z(t, e ). 3. Blocking matrix: H†(ejω )A(ejω ) = 0. 4. Noise reference signals: U(t, ejω ) = H†(ejω )Z(t, ejω ) = H†(ejω )N(t, ejω ). 5. Output signal: Y (t, ejω ) = YFBF(t, ejω ) − G†(t, ejω )U(t, ejω ). 6. Filters update For m = 1, . . . , M − 1: Um(t, ejω )Y ∗(t, ejω ) jω jω G̃m(t + 1, e ) = Gm(t, e ) + µ Pest(t, ejω ) jω FIR jω Gm(t + 1, e ) ←− G̃m(t + 1, e ) where, P Pest(t, ejω ) = ρPest(t − 1, ejω ) + (1 − ρ) m |Zm(t, ejω )|2. 7. Keep only non-aliased samples, according to the overlap & save method. 5 Speech Enhancement: TF-GSC and Postfiltering Postfiltering Postfiltering Problem Residual noise Nonstationary noise Solution Single-mic. postfilter: OM-LSA Cohen and Berdugo, 2001 MixMax Burshtein and Gannot, 2002 Multi-mic. postfilter Multi-Microphone Postfilter Main Concept Use both main output and reference signals to obtain: • Speech and transient noise distinction. • Faster noise PSD adaptation. • Speech presence probability modification. 6 Speech Enhancement: TF-GSC and Postfiltering Postfiltering Multi-Microphone Postfiltering (I) Z1 Y Detection Z2 Adaptive Beamformer ZM U2 UM of Signal q̂ Presence Source Probability Signals Estimation Noise p PSD Estimation Spectral λ̂ Enhance. Ŝ (OM-LSA) Detection of Source Signals: Recursively averaged PSD SY (t, ejω ), SUm (t, ejω ): 0 0 P SY (t, ejω ) = αs · SY (t − 1, ejω ) + (1 − αs ) Ω0 b(ejω )|Y (t, ej(ω−ω ) )|2 ω =−Ω Pseudo-stationary noise (MCRA): MY (t, ejω ), MUm (t, ejω ) Transient beam-to-reference ratio (TBRR) ψ(t, e jω )= max {SY − MY, 0} n o M max {SUm − MUm }m=2 , ε MY 4 A posteriori SNR: γs (t, ejω ) = |Y (t, ejω )|2 /MY (t, ejω ) A priori speech absence probability: 1, if γs (t, ejω ) ≤ γlow or ψ(t, ejω ) ≤ ψlow jω jω ψ −ψ(t,e ) γ −γ (t,e ) jω s high high q̂(t, e ) = , , 0 , max ψhigh −ψlow γhigh −γlow otherwise, 7 Speech Enhancement: TF-GSC and Postfiltering Postfiltering Multi-Microphone Postfiltering (II) Z1 Y Detection Z2 Adaptive Beamformer ZM U2 of UM Signal q̂ Presence Source Probability Signals Estimation Noise p PSD Estimation Spectral λ̂ Enhance. Ŝ (OM-LSA) Signal Presence Probability Estimation: A priori SNR (“decision-directed” method) ξ̂(t, e jω n o 2 jω jω jω ) = α GH (t − 1, e ) γ(t − 1, e ) + (1 − α) max γ(t, e ) − 1, 0 1 Conditional gain jω jω 4 ξ(t, e ) GH (t, e ) = exp 1 1 + ξ(t, ejω ) Noise PSD A posteriori total SNR υ(t, ejω ) à Z ! e−x 1 ∞ dx 2 υ(t,ejω ) x λ(t, ejω ) ¯2 4 ¯¯ ¯ γ(t, ejω ) = ¯Y (t, ejω )¯ /λ(t, ejω ) 4 υ(t, ejω ) = γ(t, ejω ) ξ(t, ejω )/(1 + ξ(t, ejω )) Speech presence probability: ( jω p(t, e ) = q(t, ejω ) jω jω 1+ (1 + ξ(t, e )) exp(−υ(t, e )) jω 1 − q(t, e ) )−1 8 Speech Enhancement: TF-GSC and Postfiltering Postfiltering Multi-Microphone Postfiltering (III) Z1 Y Detection Z2 Adaptive Beamformer ZM U2 of UM Signal q̂ Presence Source Probability Signals Estimation Noise p PSD Estimation Spectral λ̂ Enhance. Ŝ (OM-LSA) Noise Power Spectral Density Estimate: Time-varying frequency-dependent smoothing parameter α̃λ (t, e jω 4 jω ) = αλ + (1 − αλ ) p(t, e ) Noise PSD estimate: λ̂(t+1, e jω ) = α̃λ (t, e jω jω jω jω 2 )λ̂(t, e )+ β ·[1− α̃λ (t, e )]|Y (t, e )| 9 Speech Enhancement: TF-GSC and Postfiltering Postfiltering Multi-Microphone Postfiltering (IV) Z1 Y Detection Z2 U2 Adaptive Beamformer ZM of UM Signal Presence q̂ Source Probability Signals Estimation Noise p PSD Spectral λ̂ Estimation Enhance. Ŝ (OM-LSA) Spectral Enhancement (OM-LSA Estimator): G(t, e jω n )= jω o 1−p(t,ejω ) jω p(t,e ) GH (t, e ) ·G min 1 where, Gmin - Gain lower bound when speech is absent. Clean signal Estimate Ŝ(t, e jω ) = G(t, e jω )Y (t, e jω ) 10 Speech Enhancement: TF-GSC and Postfiltering Experiment Experimental Study Test Scenario: Speech Signal: X 4 TIMIT sentences. X 10 English digits. Noise Field: X Directional. X Nontationary Diffused. X Stationary Diffused. X Car noise. Environment: X 5 × 4 × 2.8[m3] conference room. Algorithms: X TF-GSC. X TF-GSC+OM-LSA. X Car. X TF-GSC+MIXMAX. X Multi-microphone postfilter. 11 Speech Enhancement: TF-GSC and Postfiltering Experiment Noise Level NL = Meant∈speech nonactive {10 log10(E(t))} X 2 E(t) = y (τ ) τ ∈Tt Directional Noise Field 30 Mic #1 GSC GSC+MX GSC+LSA Multi 10 0 −10 −20 −12 Mic #1 GSC GSC+MX GSC+LSA Multi 20 Noise Level[dB] 20 Noise Level[dB] Diffused and Stationary Noise Field 30 10 0 −10 −9 −6 −3 0 SNR[dB] 3 6 9 −20 −12 −9 Diffused and Nonstationary Noise Field 30 3 6 9 Car Environment Mic #1 GSC GSC+MX GSC+LSA Multi 30 Noise Level[dB] Noise Level[dB] −3 0 SNR[dB] 40 Mic #1 GSC GSC+MX GSC+LSA Multi 20 −6 10 20 0 10 −10 −20 −12 −9 −6 −3 0 SNR[dB] 3 6 9 0 −12 −9 −6 −3 0 SNR[dB] 3 6 9 12 Speech Enhancement: TF-GSC and Postfiltering Experiment Log Spectral Distance (LSD) LSD = Meant∈speech active ½q ¾ Meanω {[20 log10 |S(t, ejω )| − 20 log10 |Y (t, ejω )|]2} . Directional Noise Field 25 Diffused and Stationary Noise Field 30 Mic #1 GSC GSC+MX GSC+LSA Multi 20 Mic #1 GSC GSC+MX GSC+LSA Multi 25 20 LSD LSD 15 15 10 10 5 0 −12 5 −9 0 −12 −6 −3 0 3 6 9 SNR[dB] Diffused and Nonstationary Noise Field 25 −6 −3 0 3 SNR[dB] Car Environment 20 Mic #1 GSC GSC+MX GSC+LSA Multi 20 −9 LSD LSD 9 Mic #1 GSC GSC+MX GSC+LSA Multi 15 15 6 10 10 5 5 0 −12 −9 −6 −3 0 SNR[dB] 3 6 9 0 −12 −9 −6 −3 0 SNR[dB] 3 6 9 13 Speech Enhancement: TF-GSC and Postfiltering Experiment Sonograms (a) (b) 60 4000 60 4000 3500 3500 50 50 3000 3000 2000 30 1500 40 Frequency[Hz] Frequency[Hz] 40 2500 2500 2000 30 1500 20 20 1000 1000 10 10 500 0 2.5 500 3 3.5 4 4.5 Time[Sec] 5 5.5 6 0 2.5 0 3 3.5 4 (c) 4.5 Time[Sec] 5 5.5 6 (d) 60 4000 60 4000 3500 3500 50 50 3000 3000 2000 30 1500 40 Frequency[Hz] Frequency[Hz] 40 2500 2500 2000 30 1500 20 20 1000 1000 10 10 500 0 2.5 500 3 3.5 4 4.5 Time[Sec] 5 5.5 6 0 2.5 0 3 3.5 4 (e) 4.5 Time[Sec] 5 5.5 6 60 4000 3500 3500 50 50 3000 3000 2000 30 1500 40 Frequency[Hz] Frequency[Hz] 40 2500 2500 2000 30 1500 20 1000 20 1000 10 500 (a) (c) (e) 0 (f) 60 4000 0 2.5 0 10 500 3 3.5 4 4.5 Time[Sec] 5 5.5 6 Clean car signal. TF-GSC. TF-GSC+OM-LSA. 0 (b) (d) (f) 0 2.5 3 3.5 4 4.5 Time[Sec] 5 5.5 6 0 Noisy signal at Microphone #1. TF-GSC+MIXMAX. Multi-microphone postfilter. 14 Speech Enhancement: TF-GSC and Postfiltering Conclusions Conclusions • Diffused noise field ⇒ Postfiltering. • Nonstationary noise ⇒ Multi-Microphone postfiltering. • Multi-microphone postfilter vs. Single-Microphone postfilter: – More noise reduction. – Less speech distortion. 15