Feature Selection in Speaker Verification Systems Arnon Cohen and Yaniv Zigel Electrical and Computer Eng. Dept. Ben-Gurion University Beer-Sheva, Israel Speaker Recognition • • • • • Speaker Identification (closed and open sets) Speaker Verification Speaker Spotting Group Recognition Gender, Accent, Age,……. Speaker Tracking • Text Dependent / Text Independent • Supervised / Unsupervised Applications • Commercial – – – • Access Control (supervised verification, Text depen./Indep.)) Segmentation (Unsupervised, Text Independent) …………… Military – – • Spotting Tracking Forensic – Identification, Verification, Supervised, Unsupervised Speaker Recognition - Constraints • Natural constraints: • Relatively small amount of information on speaker’s identity in the speech signal (as compare to the message). • Features change in time (sensitivity to the train-test time difference). • The Human voice is highly sensitive to: • speaker’s physiologic and psychological states, • environment conditions (heat, coldness), Speaker Recognition - Constraints • Environmental constraints: • different SNRs, • channel transmission (telephone, internet), • microphone transmission (different handsets, distances, angles), • System’s and Application Constraints • Size of training database • Duration of test utterance • Time between Training and Testing • Memory, MIPS. Supervised Speaker Recognition Utterance Utterance Data Data Acquisition Acquisition Feature Feature Extraction Extraction Training Targets & Impostors (Enrollment) Model Model Parameters Parameters Estimation Estimation Cohorts Selection Policy Model Model of of ith ith person person “World” Model Recognition Utterance Data Data Acquisition Acquisition Feature Feature Extraction Extraction Matching Matching Strategy Strategy Rejection Rejection Strategy Strategy Speaker Verification T- Target I – Impostor X – Tested Speaker C- Cohort O – Observation { } { ( λT model of Target Speaker λI model of Impostor Speaker λc model of Cohort Speaker Sequence of feature vectors, corresponding to the utterance log P ( Ox λT ) − f P ( Ox λC ) Score )} ξ ≤ ξ; ψ ≤ψ accept repeat reject Normalization (Dynamic Threshold) Open Questions • • • • • What (and how many) features to use? What recognition engine to use? (Type, Architecture, order) Rejection policy – Cohorts? How many, how to choose? What normalization scheme should be used? How to deal with the channel problem? (robust features, normalization, adaptation…) • How to deal with the noise problem? • How to deal with deal with variations in speaker characteristics (time, physiology, psychology) • How to deal with mimicry and falsification? Features for Speaker Recognition • • • • • • • • Efficient in representing the speaker dependent information. Easy to measure Stable over time Occur naturally and frequently in speech Change little from one speaking environment to another Insensitive to mimicry and falsification Insensitive to noise and bandwidth limitations Insensitive to speaking state Such features do not exist! Features used in Speaker Recognition Systems • Vocal Tract model features • • • • • Autocorrelation coefficients (COR) Linear Prediction Coefficients (LPC) Partial Correlation coefficients (PARCOR) Log Area Ratio coefficients (LAR) Perceptional Linear Prediction (PLP) • Spectral and Cepstral features • • • • Line Spectrum Pairs (LSP) Bank of filters (linear) Bank of filters (Mel) Mel Frequency Cepstral Coefficients (MFCC) Static Features Dynamic Features ∆ , ∆∆ Other features: • Prosodic – Pitch contours – Intonation – Stress • Phonetic (pronunciation) [Andrews, Kohler & Campbell] – Phone features (requires large databases and Phone recognition system) • Idiolectal Features [Doddington] – Unigrams - probability of word occurrence – Bigrams _ probability of pairs of words (requires large databases and speech recognition systems) The “Curse” of Dimensionality Recognition Error Curse of dimensionality Expected behavior Number of Features Best dimension region Recognition Engines SVM?, Hybrid?, ???? HMM, GMM, ANN DTW, VQ Template Matching Aural & spectrogram Matching 1930 1960 1970 Small Databases Cleaned, controlled speech 1980 1990 2000 Large Databases Realistic, Unconstraint speech Rejection Policy - Cohorts feature space cohort model #3 claimed speaker model cohort model #1 cohort model #2 cohort model #4 cohort model #C Score Normalization s1 ( O ) = log p ( O | λT ) max log p ( O | λc ) c∈C (T ) 1 C s4 ( O ) = log p ( O | λT ) − ∑ log p ( O | λc ) C c =1 s2 ( O ) = log p ( O | λT ) − max log p ( O | λc ) c∈C (T ) 1 C s3 ( O ) = log p ( O | λT ) − log ∑ p ( O | λc ) C c =1 s5 ( O ) = s6 ( O ) = log p ( O | λT ) 1 C ∑ log p ( O | λc ) C c =1 log p ( O | λT ) 1 C log ∑ p ( O | λc ) C c =1 Detection Error Tradeoff (DET) curve (FR) EER Line High Security High Convenience Balance (FA) State of the Art (FR) d se a e cr n I ts in a str n Co 25% Text-Independent (read sentences) Military radio, Multiple Radios & microphones Moderate amount of training 10% Text-Independent (conversation) Telephone data, Multiple Mics Moderate amount of Training Text-depended Clean data Single Microphone Large amount of train/test data 1% 0.1% (FA) Text-Depended (Digits strings) Telephone data, Multiple mics Small amount of training data Feature Selection β2 Common Feature Space 2 3 1 4 β1 βK Speaker 1 - Feature Space Speaker 2 - Feature Space βj βb 1 2 2 3 βM 4 βi 4 3 βN 1 βa Feature Selection Problem The method for feature selection can be specified in terms of two components: I A performance criterion (effective criterion, discriminant function) for the selection of features from the input feature set. II Selection procedure. The problem of feature selection can be described as follows: Given a set y of K features select a subset x of k features ( k < K ) such that a criterion J( ) is optimized. l q x = lx |i = 1,2,..., k , x ∈ yq y = yi |i = 1,2,..., K i i Performance Criteria F= F-ratio variance of inter - speaker feature mean mean of intra - speaker feature variance c hJ J1 = tr S 2−1S1 Scatter Matrices and Separability Criteria 3 = tr S1 tr S 2 Wi + W j 1 d B = ln 2 Bhattacharyya Distance Bhattacharyya Shape d Bs = ln T Wi + W j 1 + ( mi − m j ) 2 Wi W j 8 2 −1 (m i −mj) Wi + W j 2 Wi W j R |S pbβ|ω g|ω U |V− E R |S− ln pbβ|ω g|ω U |V | pdβ|ω i |W T | pdβ|ω i W | T d D = E − ln Divergence Distance i d id d Ds = tr Wi − W j W j−1 − Wi−1 i i j j Divergence Shape J 2 = ln S 2−1S1 i j Performance Criteria • EER(O) Equal Error Rate FA=FR • EGM(O) Geometric Mean Error EGM = EFR EFA • DCF(O) Decision Cost Function DCF(O)=ρPr(FR|O)+(1-ρ) Pr(FA|O) ρ = desired weight of FR; 0 ≤ ρ ≤ 1 9. Feature Selection FA=5 FR (FR) FA=FR (EER) FA= 0.2 FR (FA) Feature optimization by a cost function Feature Selection Problem Feature Subset Selection Methods Problem: KI F K! = G J Hk K k !bK − k g! searches Exhaustive Search K-best Method Sequential Floating Forward Search (SFFS) Random Walk Backward Selection Sequential Floating Backwards Search (SFBS) Genetic Algorithms (GA) Forward Selection Branch-and-Bound (BB) The l-r Algorithm Dynamic Programming (DP) O Performance criterion for Speaker Verification In verification systems, the decision to accept or reject an identity claim is based on the comparison of a score with a threshold. s ( O) = log p ( O | λT ) the score: O - observations (from utterances) λT - target’s model τ - the threshold OT - target’s observations O I - imposters’ observations f [ s ( O) | O ∈ OT ] f [ s ( O ) | O ∈ OI ] σT σI PFA Pmiss µI τ µT s (O ) Performance criterion for Speaker Verification Evaluation of speaker verification systems: Equal Error Rate (EER). EER as a criterion for feature selection – impractical: Cost of computation, Low resolution – Due to rough histograms. His togram of target / impos ters probabilities (s peaker: 1, all 120 feature s pace) 0.035 Solution – 0.03 Gaussian assumption 0.025 0.02 f s ( O ) | O ∈ OT = ( s (O) − µ ) 1 T exp − = 2 2σ T 2π σ T Target prob. his togram Impos ters prob. his togram Gaus s ian fit of target prob. Gaus s ian fit of impos ters prob. 0.015 2 0.01 0.005 0 -160 -140 -120 -100 -80 s (O ) -60 -40 -20 0 Criterion for minimizing EER τ Pmiss = ∫ −∞ f s ( O ) | O ∈ OT ds = ∞ ∞ τ τ PFA = ∫ f s ( O ) | O ∈ O I ds = ∫ τ ∫ −∞ 2 τ − µT 1 s µ − T 1 = exp − ds erf 2π σ T 2 σ T σT 2 τ − µI 1 s µ − I ds = −erf exp − 1 2π σ I 2 σ I σI f [ s ( O ) | O ∈ OI ] σT σI PFA Pmiss µI µ − µT 1 EER = erf I + 2 σ + σ T I 1 + 2 f [ s ( O) | O ∈ OT ] EER = Pmiss = PFA ⇒ µ σ + µT σ I ... ⇒ τ = I T σ I + σT 1 + 2 τ µT µT − µ I EER ' = σ I +σT s (O ) Experimental Setup The experiment was set for: Text-dependent Speaker-verification CD-HMM: 5 states ; 2 Gaussians per state. The database Hebrew word /hamesh/ (five) - HID database High quality speech; sampled at 16KHz with 12 bits resolution Three target speakers 19 imposters for each target Number of repetitions: - training: 20 - testing: 25 - 79 Experimental Setup The features and their symbols # Feature name Order Symbols 1 Mel Frequency Cepstral Coef. (MFCC) 12 m1 ÷ m12 2 Linear Prediction Cepstral Coef. (LPCC) 12 c1 ÷ c12 3 Log Area Ratio (LAR) 12 a1 ÷ a12 4 Linear Prediction Coef. (LPC) 12 l1 ÷ l12 5 Partial Correlation (PARCOR) 12 p1 ÷ p12 6 First diff of MFCC (∆ - MFCC) 12 ∆m1 ÷ ∆m12 7 First diff of LPCC(∆ - LPCC) 12 ∆c1 ÷ ∆c12 8 First diff of LAR (∆ - LAR) 12 ∆a1 ÷ ∆a12 9 First diff of LPC (∆ - LPC) 12 ∆l1 ÷ ∆l12 10 First diff of PARCOR (∆ - PARCOR) 12 ∆p1 ÷ ∆p12 Total number of features: 120 Results EER test results of the different selection methods in different feature space dimension (for speaker #1) 50 forwa rd fe a ture s pa ce DP fe a ture s pa ce k-be s t fe a ture s pa ce 45 40 35 EER [ % ] 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 k 65 70 75 80 85 90 95 100 105 110 115 120 Results Feature selection methods: - Dynamic Programming - Forward - k – best Different selected feature set – for different speakers The Selected Features (Example for DP - 24 space dimension) Sp # 1 Selection method Dynamic Programming 2 3 Selected features m2 m3 m4 m8 m9 m11 m12 a5 a11 l11 l12 p7 p8 p11 ∆m1 ∆m5 ∆m8 ∆m10 ∆a1 ∆a3 ∆l10 ∆p3 ∆p6 ∆p11 m8 m11 m12 a8 a11 l11 l12 p11 ∆m2 ∆m5 ∆m6 ∆m8 ∆m11 ∆m12 ∆a5 ∆a9 ∆a11 ∆l11 ∆p3 ∆p4 ∆p5 ∆p8 ∆p9 ∆p11 m4 m7 m9 m11 a9 l9 p9 ∆m1 ∆m4 ∆m5 ∆m6 ∆m9 ∆m11 ∆m12 ∆c1 ∆a1 ∆a2 ∆a5 ∆l1 ∆l3 ∆l4 ∆l8 ∆p1 ∆p5 Results Average DET curves of speaker verification results Average DET curves DP s pace k-bes t s pace MFCC s pace all features s pace forward s pace 40 Mis s probability (in %) 20 10 5 EER 2 MFCC: EER = 4.78% 1 DP: EER = 2.71% 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 Fals e Alarm probability (in %) 40 improvement of 43.3% Conclusions – Where do we go from here? • Current technology achieves accuracies on the order of 10% (EER) for realistic telephone quality, with HMM/GMM. • Feature selection has a potential for increase in accuracy. • Efficient cohort selection – a potential for increase in accuracy • A breakthrough is needed – SVM? , Hybrid? , new improved speech model? New features? 12. Unsupervised Recognition (Segmentation) Example: Telephone conversation (two speakers) λsin λmul Simulta. Simulta.Speech Speech Detction Detction in λs Silence Silence Detect. Detect. Initial Initial Condit. Condit. Segmented data AA(i)(i) BB(i)(i) SS(i)(i) Train λλΑ(i)Α(i) λλΒ(i)Β(i) λλS(i)S(i) New models Closed ClosedSet Set Identification Identification Convergence Convergence Test Test λλΑΑ; ;λλΒΒ; ;λλS S