Presentation

Feature Selection in Speaker
Verification Systems
Arnon Cohen and Yaniv Zigel
Electrical and Computer Eng. Dept.
Ben-Gurion University
Beer-Sheva, Israel
Speaker Recognition
•
•
•
•
•
Speaker Identification (closed and open sets)
Speaker Verification
Speaker Spotting
Group Recognition Gender, Accent, Age,…….
Speaker Tracking
• Text Dependent / Text Independent
• Supervised / Unsupervised
Applications
•
Commercial
–
–
–
•
Access Control (supervised verification, Text depen./Indep.))
Segmentation (Unsupervised, Text Independent)
……………
Military
–
–
•
Spotting
Tracking
Forensic
–
Identification, Verification, Supervised, Unsupervised
Speaker Recognition - Constraints
• Natural constraints:
• Relatively small amount of information on speaker’s identity in the
speech signal (as compare to the message).
• Features change in time (sensitivity to the train-test time difference).
• The Human voice is highly sensitive to:
• speaker’s physiologic and psychological states,
• environment conditions (heat, coldness),
Speaker Recognition - Constraints
• Environmental constraints:
• different SNRs,
• channel transmission (telephone, internet),
• microphone transmission (different handsets, distances,
angles),
•
System’s and Application Constraints
• Size of training database
• Duration of test utterance
• Time between Training and Testing
• Memory, MIPS.
Supervised Speaker Recognition
Utterance
Utterance
Data
Data
Acquisition
Acquisition
Feature
Feature
Extraction
Extraction
Training
Targets
&
Impostors
(Enrollment)
Model
Model
Parameters
Parameters
Estimation
Estimation
Cohorts
Selection
Policy
Model
Model
of
of ith
ith
person
person
“World”
Model
Recognition
Utterance
Data
Data
Acquisition
Acquisition
Feature
Feature
Extraction
Extraction
Matching
Matching
Strategy
Strategy
Rejection
Rejection
Strategy
Strategy
Speaker Verification
T- Target
I – Impostor
X – Tested Speaker
C- Cohort
O – Observation
{
} { (
λT model of Target Speaker
λI model of Impostor Speaker
λc model of Cohort Speaker
Sequence of feature vectors, corresponding
to the utterance
log P ( Ox λT ) − f P ( Ox λC )
Score
)}
ξ
≤ ξ; ψ
≤ψ
accept
repeat
reject
Normalization (Dynamic Threshold)
Open Questions
•
•
•
•
•
What (and how many) features to use?
What recognition engine to use? (Type, Architecture, order)
Rejection policy – Cohorts? How many, how to choose?
What normalization scheme should be used?
How to deal with the channel problem? (robust features,
normalization, adaptation…)
• How to deal with the noise problem?
• How to deal with deal with variations in speaker
characteristics (time, physiology, psychology)
• How to deal with mimicry and falsification?
Features for Speaker Recognition
•
•
•
•
•
•
•
•
Efficient in representing the speaker dependent information.
Easy to measure
Stable over time
Occur naturally and frequently in speech
Change little from one speaking environment to another
Insensitive to mimicry and falsification
Insensitive to noise and bandwidth limitations
Insensitive to speaking state
Such features do not exist!
Features used in Speaker Recognition Systems
• Vocal Tract model features
•
•
•
•
•
Autocorrelation coefficients (COR)
Linear Prediction Coefficients (LPC)
Partial Correlation coefficients (PARCOR)
Log Area Ratio coefficients (LAR)
Perceptional Linear Prediction (PLP)
• Spectral and Cepstral features
•
•
•
•
Line Spectrum Pairs (LSP)
Bank of filters (linear)
Bank of filters (Mel)
Mel Frequency Cepstral Coefficients (MFCC)
Static Features
Dynamic Features
∆ , ∆∆
Other features:
• Prosodic
– Pitch contours
– Intonation
– Stress
• Phonetic (pronunciation)
[Andrews, Kohler & Campbell]
– Phone features
(requires large databases and Phone recognition system)
• Idiolectal Features
[Doddington]
– Unigrams - probability of word occurrence
– Bigrams _ probability of pairs of words
(requires large databases and speech recognition systems)
The “Curse” of Dimensionality
Recognition
Error
Curse of dimensionality
Expected behavior
Number of Features
Best dimension region
Recognition Engines
SVM?, Hybrid?, ????
HMM, GMM, ANN
DTW, VQ
Template Matching
Aural & spectrogram Matching
1930
1960
1970
Small Databases
Cleaned, controlled speech
1980
1990
2000
Large Databases
Realistic, Unconstraint speech
Rejection Policy - Cohorts
feature
space
cohort
model #3
claimed
speaker
model
cohort
model #1
cohort
model #2
cohort
model #4
cohort
model #C
Score Normalization
s1 ( O ) =
log p ( O | λT )
max log p ( O | λc ) 
c∈C (T ) 
1 C
s4 ( O ) = log p ( O | λT ) − ∑ log p ( O | λc )
C c =1
s2 ( O ) = log p ( O | λT ) − max log p ( O | λc ) 
c∈C (T )
1 C

s3 ( O ) = log p ( O | λT ) − log  ∑ p ( O | λc ) 
 C c =1

s5 ( O ) =
s6 ( O ) =
log p ( O | λT )
1 C
∑ log p ( O | λc )
C c =1
log p ( O | λT )
1 C

log  ∑ p ( O | λc ) 
 C c =1

Detection Error Tradeoff (DET) curve
(FR)
EER Line
High Security
High Convenience
Balance
(FA)
State of the Art
(FR)
d
se
a
e
cr
n
I
ts
in
a
str
n
Co
25%
Text-Independent
(read sentences)
Military radio, Multiple
Radios & microphones
Moderate amount of training
10%
Text-Independent
(conversation)
Telephone data, Multiple Mics
Moderate amount of Training
Text-depended
Clean data
Single Microphone
Large amount of train/test data
1%
0.1%
(FA)
Text-Depended
(Digits strings)
Telephone data, Multiple mics
Small amount of training data
Feature Selection
β2
Common
Feature
Space
2
3
1
4
β1
βK
Speaker 1 - Feature Space
Speaker 2 - Feature Space
βj
βb
1
2
2
3
βM
4
βi
4
3
βN
1
βa
Feature Selection Problem
The method for feature selection can be specified in terms of two components:
I A performance criterion (effective criterion, discriminant function) for the
selection of features from the input feature set.
II Selection procedure.
The problem of feature selection can be described as follows:
Given a set y of K features
select a subset x of k features ( k < K )
such that a criterion J( ) is optimized.
l
q
x = lx |i = 1,2,..., k , x ∈ yq
y = yi |i = 1,2,..., K
i
i
Performance Criteria
F=
F-ratio
variance of inter - speaker feature mean
mean of intra - speaker feature variance
c hJ
J1 = tr S 2−1S1
Scatter Matrices and Separability Criteria
3
=
tr S1
tr S 2
Wi + W j
1
d B = ln
2
Bhattacharyya Distance
Bhattacharyya Shape
d Bs = ln
T  Wi + W j 
1
+ ( mi − m j ) 

2


Wi W j 8
2
−1
(m
i
−mj)
Wi + W j
2
Wi W j
R
|S pbβ|ω g|ω U
|V− E R
|S− ln pbβ|ω g|ω U
|V
| pdβ|ω i |W T
| pdβ|ω i W
|
T
d D = E − ln
Divergence Distance
i
d
id
d Ds = tr Wi − W j W j−1 − Wi−1
i
i
j
j
Divergence Shape
J 2 = ln S 2−1S1
i
j
Performance Criteria
• EER(O)
Equal Error Rate
FA=FR
• EGM(O)
Geometric Mean Error
EGM = EFR EFA
• DCF(O)
Decision Cost Function
DCF(O)=ρPr(FR|O)+(1-ρ) Pr(FA|O)
ρ = desired weight of FR; 0 ≤ ρ ≤ 1
9. Feature Selection
FA=5 FR
(FR)
FA=FR (EER)
FA= 0.2 FR
(FA)
Feature optimization by a cost function
Feature Selection Problem
Feature Subset Selection Methods
Problem:
KI
F
K!
=
G
J
Hk K k !bK − k g!
searches
Exhaustive Search
K-best Method
Sequential Floating
Forward Search (SFFS)
Random Walk
Backward Selection
Sequential Floating
Backwards Search (SFBS)
Genetic Algorithms (GA)
Forward Selection
Branch-and-Bound (BB)
The l-r Algorithm
Dynamic Programming (DP)
O
Performance criterion for Speaker Verification
In verification systems, the decision to accept or reject an identity
claim is based on the comparison of a score with a threshold.
s ( O) = log p ( O | λT )
the score:
O
- observations (from utterances)
λT
- target’s model
τ
- the threshold
OT - target’s observations
O I - imposters’ observations
f [ s ( O) | O ∈ OT ]
f [ s ( O ) | O ∈ OI ]
σT
σI
PFA
Pmiss
µI
τ
µT
s (O )
Performance criterion for Speaker Verification
Evaluation of speaker verification systems: Equal Error Rate (EER).
EER as a criterion for feature selection – impractical:
Cost of computation,
Low resolution – Due to rough histograms.
His togram of target / impos ters probabilities (s peaker: 1, all 120 feature s pace)
0.035
Solution –
0.03
Gaussian assumption
0.025
0.02
f  s ( O ) | O ∈ OT  =
 ( s (O) − µ )
1
T
exp  −
=
2
2σ T

2π σ T

Target prob. his togram
Impos ters prob. his togram
Gaus s ian fit of target prob.
Gaus s ian fit of impos ters prob.
0.015
2




0.01
0.005
0
-160
-140
-120
-100
-80
s (O )
-60
-40
-20
0
Criterion for minimizing EER
τ
Pmiss =
∫
−∞
f  s ( O ) | O ∈ OT  ds =
∞
∞
τ
τ
PFA = ∫ f  s ( O ) | O ∈ O I  ds = ∫
τ
∫
−∞
2




 τ − µT
1
s
µ
−
T
1

=
exp  − 
ds
erf


2π σ T
 2  σ T  
 σT
2




 τ − µI
1
s
µ
−
I
 ds = −erf 
exp  − 1 

2π σ I
 2  σ I  
 σI
f [ s ( O ) | O ∈ OI ]
σT
σI
PFA
Pmiss
µI
 µ − µT  1
EER = erf  I
+ 2
σ
+
σ
T 
 I
 1
+ 2

f [ s ( O) | O ∈ OT ]
EER = Pmiss = PFA ⇒
µ σ + µT σ I
... ⇒ τ = I T
σ I + σT
 1
+ 2

τ
µT
µT − µ I
EER ' =
σ I +σT
s (O )
Experimental Setup
The experiment was set for:
Text-dependent
Speaker-verification
CD-HMM: 5 states ; 2 Gaussians per state.
The database
Hebrew word /hamesh/ (five) - HID database
High quality speech; sampled at 16KHz with 12 bits resolution
Three target speakers
19 imposters for each target
Number of repetitions:
- training: 20
- testing: 25 - 79
Experimental Setup
The features and their symbols
#
Feature name
Order
Symbols
1
Mel Frequency Cepstral Coef. (MFCC)
12
m1 ÷ m12
2
Linear Prediction Cepstral Coef. (LPCC)
12
c1 ÷ c12
3
Log Area Ratio (LAR)
12
a1 ÷ a12
4
Linear Prediction Coef. (LPC)
12
l1 ÷ l12
5
Partial Correlation (PARCOR)
12
p1 ÷ p12
6
First diff of MFCC (∆ - MFCC)
12
∆m1 ÷ ∆m12
7
First diff of LPCC(∆ - LPCC)
12
∆c1 ÷ ∆c12
8
First diff of LAR (∆ - LAR)
12
∆a1 ÷ ∆a12
9
First diff of LPC (∆ - LPC)
12
∆l1 ÷ ∆l12
10
First diff of PARCOR (∆ - PARCOR)
12
∆p1 ÷ ∆p12
Total number of features:
120
Results
EER test results of the different selection methods
in different feature space dimension (for speaker #1)
50
forwa rd fe a ture s pa ce
DP fe a ture s pa ce
k-be s t fe a ture s pa ce
45
40
35
EER [ % ]
30
25
20
15
10
5
0
0
5
10
15
20
25
30
35
40
45
50
55
60
k
65
70
75
80
85
90
95
100 105 110 115 120
Results
Feature selection methods:
- Dynamic Programming
- Forward
- k – best
Different selected feature set – for different speakers
The Selected Features (Example for DP - 24 space dimension)
Sp
#
1
Selection
method
Dynamic
Programming
2
3
Selected features
m2 m3 m4 m8 m9 m11 m12
a5 a11 l11 l12 p7 p8 p11
∆m1 ∆m5 ∆m8 ∆m10
∆a1 ∆a3 ∆l10 ∆p3 ∆p6 ∆p11
m8 m11 m12
a8 a11 l11 l12 p11
∆m2 ∆m5 ∆m6 ∆m8 ∆m11 ∆m12
∆a5 ∆a9 ∆a11 ∆l11 ∆p3 ∆p4 ∆p5 ∆p8 ∆p9 ∆p11
m4 m7 m9 m11
a9 l9 p9
∆m1 ∆m4 ∆m5 ∆m6 ∆m9 ∆m11 ∆m12
∆c1 ∆a1 ∆a2 ∆a5 ∆l1 ∆l3 ∆l4 ∆l8 ∆p1 ∆p5
Results
Average DET curves of speaker verification results
Average DET curves
DP s pace
k-bes t s pace
MFCC s pace
all features s pace
forward s pace
40
Mis s probability (in %)
20
10
5
EER
2
MFCC: EER = 4.78%
1
DP: EER = 2.71%
0.5
0.2
0.1
0.1 0.2 0.5
1
2
5
10
20
Fals e Alarm probability (in %)
40
improvement of
43.3%
Conclusions – Where do we go from here?
•
Current technology achieves accuracies on the order of 10% (EER) for realistic
telephone quality, with HMM/GMM.
•
Feature selection has a potential for increase in accuracy.
•
Efficient cohort selection – a potential for increase in accuracy
•
A breakthrough is needed –
SVM? , Hybrid? , new improved speech model? New features?
12. Unsupervised Recognition (Segmentation)
Example: Telephone conversation (two speakers)
λsin λmul
Simulta.
Simulta.Speech
Speech
Detction
Detction
in
λs
Silence
Silence
Detect.
Detect.
Initial
Initial
Condit.
Condit.
Segmented data
AA(i)(i)
BB(i)(i)
SS(i)(i)
Train
λλΑ(i)Α(i)
λλΒ(i)Β(i)
λλS(i)S(i)
New models
Closed
ClosedSet
Set
Identification
Identification
Convergence
Convergence
Test
Test
λλΑΑ; ;λλΒΒ; ;λλS S