Presentation

Speech Enhancement Based on
the General Transfer Function
GSC and Postfiltering
Sharon Gannot and Israel Cohen
Technion - Israeli Institute of Technology
Faculty of Electrical Engineering
Signal and Image Processing Laboratory
June, 2003
Speech Enhancement: TF-GSC and Postfiltering
Problem
Problem Formulation
A1 (ejω )
A2 (ejω )
s(t)
ns1 (t)
nt1 (t)
P
P
ns2 (t)
nt2 (t)
P
P
nsM (t)
ntM (t)
P
P
z1 (t)
z2 (t)
AM (ejω )
zM (t)
Z(t, ejω ) =A(ejω )S(t, ejω )+Ns(t, ejω )+Nt(t, ejω )
where
£
¤
T
jω
Z (t, e ) = Z1(t, ejω ) Z2(t, ejω ) · · · ZM (t, ejω )
£
¤
T
jω
A (e ) = A1(ejω ) A2(ejω ) · · · AM (ejω )
¤
£ s
jω
T
jω
jω
s
jω
s
Ns (t, e ) = N1 (t, e ) N2 (t, e ) · · · NM (t, e )
¤
£
jω
T
t
Nt (t, e ) = N1t(t, ejω ) N2t(t, ejω ) · · · NM
(t, ejω ) .
1
Speech Enhancement: TF-GSC and Postfiltering
Outline
Outline
Goal
Find Ŝ(t, ejω ) from Z(t, ejω ) using:
Beamforming and Postfiltering.
• The general Transfer function GSC (TF-GSC).
• Postfiltering:
– Single-microphone: OM-LSA, MixMax.
– Multi-microphone.
• Experimental study.
2
Speech Enhancement: TF-GSC and Postfiltering
Beamforming
Wideband Beamformer
Z1 (t, ejω )
W1∗ (t, ejω )
Z2 (t, ejω )
W2∗ (t, ejω )
Z3 (t, ejω )
W3∗ (t, ejω )
P
Y (t, ejω )
ZM (t, ejω )
∗ (t, ejω )
WM
Beamformer output:
Y (t, ejω ) = W†(t, ejω )Z(t, ejω )
h
i
jω
jω
where, W (t, e ) = W1(t, e ), . . . , WM (t, e ) .
T
jω
Minimum Variance Distortionless Beamformer:
min W†(t, ejω )Szz (t, ejω )W(t, ejω )
W
subject to W†(t, ejω )A(ejω ) = F ∗(t, ejω )
where, Szz (t, ejω ) = E{Z(t, ejω )Z†(t, ejω )}.
3
Speech Enhancement: TF-GSC and Postfiltering
Beamforming
The General Transfer Function GSC
Gannot, Burshtein and Weinstein, 2001
Z1 (t, ejω )
Z2 (t, ejω )
Z3 (t, ejω )
YFBF (t, ejω )
W †0
Y (t, ejω )
+
P
−
YNC (t, ejω )
ZM (t, ejω )
U2 (t, ejω )
U3 (t, ejω )
H†
G2 (t, ejω )
G3 (t, ejω )
P
UM (t, ejω )
GM (t, ejω )
• Fixed beamformer (FBF)The ATFs ratios matched filter.
• Blocking Matrix (BM)using ATFs ratios.
• Noise canceller (NC)using multi-channel LMS.
4
Speech Enhancement: TF-GSC and Postfiltering
Beamforming
TF-GSC - Main Formulas
1. TF-s ratios: H(ejω ) =
A(ejω )
.
A1 (ejω )
H(ejω )
jω
F
(e
).
jω
2
kH(e )k
jω
jω
2. Fixed beamformer W0(t, ejω ) =
FBF output YFBF(t, ejω ) = W0† (e )Z(t, e ).
3. Blocking matrix: H†(ejω )A(ejω ) = 0.
4. Noise reference signals:
U(t, ejω ) = H†(ejω )Z(t, ejω ) = H†(ejω )N(t, ejω ).
5. Output signal:
Y (t, ejω ) = YFBF(t, ejω ) − G†(t, ejω )U(t, ejω ).
6. Filters update For m = 1, . . . , M − 1:
Um(t, ejω )Y ∗(t, ejω )
jω
jω
G̃m(t + 1, e ) = Gm(t, e ) + µ
Pest(t, ejω )
jω
FIR
jω
Gm(t + 1, e ) ←− G̃m(t + 1, e )
where,
P
Pest(t, ejω ) = ρPest(t − 1, ejω ) + (1 − ρ) m |Zm(t, ejω )|2.
7. Keep only non-aliased samples, according to the
overlap & save method.
5
Speech Enhancement: TF-GSC and Postfiltering
Postfiltering
Postfiltering
Problem
Residual noise
Nonstationary noise
Solution
Single-mic. postfilter:
OM-LSA Cohen and Berdugo, 2001
MixMax Burshtein and Gannot, 2002
Multi-mic. postfilter
Multi-Microphone Postfilter
Main Concept
Use both main output and reference signals to obtain:
• Speech and transient noise distinction.
• Faster noise PSD adaptation.
• Speech presence probability modification.
6
Speech Enhancement: TF-GSC and Postfiltering
Postfiltering
Multi-Microphone Postfiltering (I)
Z1
Y
Detection
Z2
Adaptive
Beamformer
ZM
U2
UM
of
Signal
q̂
Presence
Source
Probability
Signals
Estimation
Noise
p
PSD
Estimation
Spectral
λ̂
Enhance.
Ŝ
(OM-LSA)
Detection of Source Signals:
Recursively averaged PSD SY (t, ejω ), SUm (t, ejω ):
0
0
P
SY (t, ejω ) = αs · SY (t − 1, ejω ) + (1 − αs ) Ω0
b(ejω )|Y (t, ej(ω−ω ) )|2
ω =−Ω
Pseudo-stationary noise (MCRA): MY (t, ejω ), MUm (t, ejω )
Transient beam-to-reference ratio (TBRR)
ψ(t, e
jω
)=
max {SY − MY, 0}
n
o
M
max {SUm − MUm }m=2 , ε MY
4
A posteriori SNR: γs (t, ejω ) = |Y (t, ejω )|2 /MY (t, ejω )
A priori speech absence probability:


1, if γs (t, ejω ) ≤ γlow or ψ(t, ejω ) ≤ ψlow







jω
jω


ψ
−ψ(t,e
)
γ
−γ
(t,e
)
jω
s
high
high
q̂(t, e ) =
,
,
0
,
max

ψhigh −ψlow


 γhigh −γlow




otherwise,
7
Speech Enhancement: TF-GSC and Postfiltering
Postfiltering
Multi-Microphone Postfiltering (II)
Z1
Y
Detection
Z2
Adaptive
Beamformer
ZM
U2
of
UM
Signal
q̂
Presence
Source
Probability
Signals
Estimation
Noise
p
PSD
Estimation
Spectral
λ̂
Enhance.
Ŝ
(OM-LSA)
Signal Presence Probability Estimation:
A priori SNR (“decision-directed” method)
ξ̂(t, e
jω
n
o
2
jω
jω
jω
) = α GH (t − 1, e ) γ(t − 1, e ) + (1 − α) max γ(t, e ) − 1, 0
1
Conditional gain
jω
jω 4 ξ(t, e )
GH (t, e ) =
exp
1
1 + ξ(t, ejω )
Noise PSD
A posteriori total SNR
υ(t, ejω )
à Z
!
e−x
1 ∞
dx
2 υ(t,ejω ) x
λ(t, ejω )
¯2
4 ¯¯
¯
γ(t, ejω ) = ¯Y (t, ejω )¯ /λ(t, ejω )
4
υ(t, ejω ) = γ(t, ejω ) ξ(t, ejω )/(1 + ξ(t, ejω ))
Speech presence probability:
(
jω
p(t, e ) =
q(t, ejω )
jω
jω
1+
(1 + ξ(t, e )) exp(−υ(t, e ))
jω
1 − q(t, e )
)−1
8
Speech Enhancement: TF-GSC and Postfiltering
Postfiltering
Multi-Microphone Postfiltering (III)
Z1
Y
Detection
Z2
Adaptive
Beamformer
ZM
U2
of
UM
Signal
q̂
Presence
Source
Probability
Signals
Estimation
Noise
p
PSD
Estimation
Spectral
λ̂
Enhance.
Ŝ
(OM-LSA)
Noise Power Spectral Density Estimate:
Time-varying frequency-dependent smoothing parameter
α̃λ (t, e
jω 4
jω
) = αλ + (1 − αλ ) p(t, e )
Noise PSD estimate:
λ̂(t+1, e
jω
) = α̃λ (t, e
jω
jω
jω
jω 2
)λ̂(t, e )+ β ·[1− α̃λ (t, e )]|Y (t, e )|
9
Speech Enhancement: TF-GSC and Postfiltering
Postfiltering
Multi-Microphone Postfiltering (IV)
Z1
Y
Detection
Z2
U2
Adaptive
Beamformer
ZM
of
UM
Signal
Presence
q̂
Source
Probability
Signals
Estimation
Noise
p
PSD
Spectral
λ̂
Estimation
Enhance.
Ŝ
(OM-LSA)
Spectral Enhancement (OM-LSA Estimator):
G(t, e
jω
n
)=
jω
o
1−p(t,ejω )
jω p(t,e )
GH (t, e )
·G
min
1
where, Gmin - Gain lower bound when speech is absent.
Clean signal Estimate
Ŝ(t, e
jω
) = G(t, e
jω
)Y (t, e
jω
)
10
Speech Enhancement: TF-GSC and Postfiltering
Experiment
Experimental Study
Test Scenario:
Speech Signal:
X 4 TIMIT sentences.
X 10 English digits.
Noise Field:
X Directional.
X Nontationary Diffused.
X Stationary Diffused.
X Car noise.
Environment:
X 5 × 4 × 2.8[m3] conference room.
Algorithms:
X TF-GSC.
X TF-GSC+OM-LSA.
X Car.
X TF-GSC+MIXMAX.
X Multi-microphone postfilter.
11
Speech Enhancement: TF-GSC and Postfiltering
Experiment
Noise Level
NL = Meant∈speech nonactive {10 log10(E(t))}
X 2
E(t) =
y (τ )
τ ∈Tt
Directional Noise Field
30
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
10
0
−10
−20
−12
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
20
Noise Level[dB]
20
Noise Level[dB]
Diffused and Stationary Noise Field
30
10
0
−10
−9
−6
−3
0
SNR[dB]
3
6
9
−20
−12
−9
Diffused and Nonstationary Noise Field
30
3
6
9
Car Environment
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
30
Noise Level[dB]
Noise Level[dB]
−3
0
SNR[dB]
40
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
20
−6
10
20
0
10
−10
−20
−12
−9
−6
−3
0
SNR[dB]
3
6
9
0
−12
−9
−6
−3
0
SNR[dB]
3
6
9
12
Speech Enhancement: TF-GSC and Postfiltering
Experiment
Log Spectral Distance (LSD)
LSD = Meant∈speech active
½q
¾
Meanω {[20 log10 |S(t, ejω )| − 20 log10 |Y (t, ejω )|]2} .
Directional Noise Field
25
Diffused and Stationary Noise Field
30
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
20
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
25
20
LSD
LSD
15
15
10
10
5
0
−12
5
−9
0
−12
−6
−3
0
3
6
9
SNR[dB]
Diffused and Nonstationary Noise Field
25
−6
−3
0
3
SNR[dB]
Car Environment
20
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
20
−9
LSD
LSD
9
Mic #1
GSC
GSC+MX
GSC+LSA
Multi
15
15
6
10
10
5
5
0
−12
−9
−6
−3
0
SNR[dB]
3
6
9
0
−12
−9
−6
−3
0
SNR[dB]
3
6
9
13
Speech Enhancement: TF-GSC and Postfiltering
Experiment
Sonograms
(a)
(b)
60
4000
60
4000
3500
3500
50
50
3000
3000
2000
30
1500
40
Frequency[Hz]
Frequency[Hz]
40
2500
2500
2000
30
1500
20
20
1000
1000
10
10
500
0
2.5
500
3
3.5
4
4.5
Time[Sec]
5
5.5
6
0
2.5
0
3
3.5
4
(c)
4.5
Time[Sec]
5
5.5
6
(d)
60
4000
60
4000
3500
3500
50
50
3000
3000
2000
30
1500
40
Frequency[Hz]
Frequency[Hz]
40
2500
2500
2000
30
1500
20
20
1000
1000
10
10
500
0
2.5
500
3
3.5
4
4.5
Time[Sec]
5
5.5
6
0
2.5
0
3
3.5
4
(e)
4.5
Time[Sec]
5
5.5
6
60
4000
3500
3500
50
50
3000
3000
2000
30
1500
40
Frequency[Hz]
Frequency[Hz]
40
2500
2500
2000
30
1500
20
1000
20
1000
10
500
(a)
(c)
(e)
0
(f)
60
4000
0
2.5
0
10
500
3
3.5
4
4.5
Time[Sec]
5
5.5
6
Clean car signal.
TF-GSC.
TF-GSC+OM-LSA.
0
(b)
(d)
(f)
0
2.5
3
3.5
4
4.5
Time[Sec]
5
5.5
6
0
Noisy signal at Microphone #1.
TF-GSC+MIXMAX.
Multi-microphone postfilter.
14
Speech Enhancement: TF-GSC and Postfiltering
Conclusions
Conclusions
• Diffused noise field ⇒ Postfiltering.
• Nonstationary noise ⇒
Multi-Microphone postfiltering.
• Multi-microphone postfilter vs.
Single-Microphone postfilter:
– More noise reduction.
– Less speech distortion.
15