IBM Research The IBM Semantic Concept Detection Framework Arnon Amir, Giri Iyengar, Ching-Yung Lin, Chitra Dorai, Milind Naphade, Apostol Natsev, Chalapathy Neti, Harriet Nock, Ishan Sachdev, John Smith, Yi Wu, Belle Tseng, Dongqing Zhang 11/17/2003 | TRECVID Workshop 2003 © 2002 IBM Corporation IBM Research Outline q Concept Detection as a Machine Learning Problem q The IBM TREC 2003 Concept Detection Framework § Modeling in Low-level Features § Multi-classifier Decision fusion § Modeling in High-level (semantic) Features q Putting it All Together: TREC 2003 Concept Detection q Observations 2 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Multimedia Analytics by Supervised Learning User Annotation Training Video Repository Feature Extraction Test Videos Feature Extraction Training Semantic Concept Models Detection MPEG-7 Annotations Analysis 3 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Multi-layered Concept Detection: Working in Increasingly (Semantically) Meaningful Feature • Improving Detection Spaces • Building Complex Concepts (e.g. News Subject Monologue) High-level Feature Space Models e.g. Multinet, DMF (SVM, NN), Propagation Rules Low-level Feature Space Models e.g. SVM, GMM, HMM, TF-IDF Videos Low-level Feature Extraction Detection using Models built in low-level Feature Spaces e.g. Color, texture, Shape, MFCC, Motion 4 The IBM TREC-2003 Concept Detection Framework High-level Feature Space Mapping Detection and Manipulation in High-level Feature Spaces Cityscape face e.g. Face,People, Cityscape etc. People © 2003 IBM Corporation IBM Research The Evolving IBM Concept Detection System IBM TREC’01, 02 Post TREC’ 02 Experiments IBM TREC’03 Use of SVM, GMM and HMM Classifiers for modeling lowlevel features Use of SVM, GMM and HMM Classifiers for modeling low-level and high-level features Use of SVM, GMM and HMM Classifiers for low-level and highlevel features Ensemble and Discriminant Fusion (TREC02) of Multiple Models of Same Concept Improved performance over single models Ensemble and Discriminant Fusion of Multiple Models of Same Concept Improved performance over single models Ensemble and Discriminant Fusion of Multiple Models of Same Concept Improved performance over single models Rule-based Preprocessing (e.g. Non-Studio Setting= (NOT(Studio_Indoor_Setting)) OR (Outdoors)) Validity Weighted Similarity Improves Robustness Validity Weighted Similarity Improves Robustness Semantic feature based Models (Multinet, DMF) Improves Performance over Single-concept models Semantic feature based Models (Multinet, DMF-SVMs, NN, Boosting), Ontology Improves Performance over Single-concept models Post-Filtering Improves Precision 5 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Video Concept Detection Pipeline Annotation and Data Preparation Feature Extraction Low-level Feature-based Models SD/A CLG V1 CT V2 WT TAM Feature Extraction Videos MI V2 MV AUD : training only : training and testing 6 V1 BOU 47 Other Concepts EH Best Uni-model Ensemble Run The IBM TREC-2003 Concept Detection Framework VW Filtering Region Extraction Annotation CH 17 TREC Benchmark Concepts CC High-level (semantic) PostFusing Context based processing Models of Methods each concept across lowlevel featureMN based techniques DMF17 MLP EF MLP ONT EF2 DMF64 BOF Best Multi-modal Ensemble Run BOBO Best-of-the-Best Ensemble Run © 2003 IBM Corporation IBM Research Corpus Issues q Multi-layered Detection Approach needs multiple sets for cross validation q Partitioning of Feature Development Set so that each level of processing has a training set and a test set partition that is unadulterated by the processing at the previous level. q E.g. Low-level feature based concept models built using Training Set and performance optimized over Validation Set. q Single-Concept, multi-model fusion is performed using Validation Set for training and Fusion Validation Set 1 for testing. q Semantic-level fusion is performed by using Fusion Validation Set 1 as the training set and Fusion Validation Set 2 as the test set q Runs submitted to NIST are chosen finally on performance of all systems and algorithms on Fusion Validation Set 2. Fusion Validation Set 2 20% Training Set Partitioning procedure All videos aligned by their temporal order and Validation Set 1 For each set of 10 videos Fusion Validation Set 1 10% Training Set 60% Fusion Validation Set 1 Validation Set 1 10% Fusion Validation Set 2 • First 6 -> Training Set, • 7th -> Validation • 8th -> Fusion Validation Set 1 • Last 2 ->Fusion Validation Set 2. 7 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Video Concept Detection Pipeline: Features Annotation and Data Preparation Feature Extraction CC Region Extraction Annotation CH CLG CT WT TAM EH Feature Extraction Videos MI MV AUD : training only : training and testing 8 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Feature Extraction Features extracted globally and regionally Color: Lexicon Color histograms (512 dim), Auto-Correlograms (166 dim) Shot Segmentation Annotation Feature Extraction Region Segmentation Structure & Shape: Edge orientation histogram (64 dim), Dudani Moment Invariants (6 dim), Texture Co-occurrence texture (96 dim), Coarseness (1 dim), Contrast (1 dim), Directionality (1 dim), Wavelet (12 dim) Motion Motion vector histogram (6 dim) Audio MFCC Text ASR Transcripts Regions Object (motion, Camera registration) Background (5 regions / shot) References: Lin (ICME 2003) 9 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Video Concept Detection Pipeline: Low-level Feature Modeling Annotation and Data Preparation Feature Extraction Low-level Feature-based Models Region Extraction Annotation CH SD/A CLG V1 CT V2 WT TAM Feature Extraction Videos MI V2 MV AUD : training only : training and testing 10 V1 BOU 47 Other Concepts EH 17 TREC Benchmark Concepts CC Best Uni-model Ensemble Run The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Low-level Feature-based Concept Models Statistical Learning for Concept Building: SVM Features Features f1 f2 : fM Training Set SVM Validation Set Validation Set Grid Search fM+1 fM+2 : fK m1 m2 : mP mP+1 mP+2 : Fusion (normalization & aggregation) model q SVM models used for 2 sets of visual features § § Combined Color correlogram, edge histogram, cooccurrence features and moment invariants Color histogram, motion, Tamura texture features q For each concept § § q q q q q 11 Built multiple models for each feature set by varying kernels and parameters. Upto 27 models for each concept built for each feature type A total of 64 concepts from the TREC 2003 lexicon covered through SVM-based models Validation Set is used to then search for the best model parameters and feature set. Identical Approach as in IBM System for TREC 2002 Fusion Validation Set II MAP: 0.22 References: IBM TREC 2002, Naphade et al (ICME 2003, ICIP 2003) The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Low-level Feature-based Concept Models: Statistical Learning for Concept Building based on ASR Transcripts TRAINING: Manually examine examples to find frequently co-occurring relevant words … some weather news overseas WEATHER NEWS QUERY WORD SET: weather news low pressure storm cloudy mild windy … (etc) … OKAPI SYSTEM FOR SEARCH TEXT ASR TRANSCRIPTS … update on low pressure storm Ranked Shots Fusion Validation II MAP = 0.19 References: Nock et al (SIGIR 2003) 12 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Video Concept Detection Pipeline: Fusion I Annotation and Data Preparation Feature Extraction Low-level Feature-based Models Region Extraction Annotation CH SD/A CLG V1 CT V2 WT TAM Feature Extraction Videos MI V2 MV AUD : training only : training and testing 13 V1 BOU 47 Other Concepts EH 17 TREC Benchmark Concepts CC Best Uni-model Ensemble Run The IBM TREC-2003 Concept Detection Framework Fusing Models of each concept across lowlevel featurebased techniques VW EF MLP EF2 BOF Best Multi-modal Ensemble Run © 2003 IBM Corporation IBM Research Multi-Modality/ Multi-Concept Fusion Methods Ensemble Fusion: • Normalization: rank, Gaussian, linear. • Combination: average, product, min, max • Works well for uni-modal concepts with few training examples • Computationally low-cost method of combining multiple classifiers. • Fusion Validation Set II MAP: 0.254 • SearchTest MAP: 0.26 • References: Tseng et al (ICME 2003, ICIP 2003) 14 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Multi-Modality/ Multi-Concept Fusion Methods: Validity Weighting Validity Weighting: • Work in the high-level feature space generated by classifier confidences for all concepts • Basic idea is to give more importance to reliable classifiers. • Revise distance metric to include a measure of the goodness of the classifier. • Many fitness or goodness measures • Average Precision • 10-point AP • Equal Error rate • Number of Training Samples in Training Set. • Computationally efficient and low-cost option of merit/performance-based combining multiple classifiers based on • Improves robustness due to enhanced reliability on high-performance classifiers. • Fusion Validation Set II MAP: 0.255 • References: Smith et al (ICME 2003, ICIP 2003) 15 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Video Concept Detection Pipeline: Semantic-Feature based Models Annotation and Data Preparation Feature Extraction Low-level Feature-based Models Region Extraction Annotation CH SD/A CLG V1 CT V2 WT TAM Feature Extraction Videos MI V2 MV AUD : training only : training and testing 16 V1 BOU 47 Other Concepts EH 17 TREC Benchmark Concepts CC Best Uni-model Ensemble Run The IBM TREC-2003 Concept Detection Framework High-level (semantic) Fusing Context based Models of Methods each concept across lowlevel featureMN based techniques DMF17 VW MLP EF MLP ONT EF2 DMF64 BOF Best Multi-modal Ensemble Run BOBO Best-of-the-Best Ensemble Run © 2003 IBM Corporation IBM Research Semantic Feature Based Models Incorporating Context q Multinet: A probabilistic graphical context modeling framework that uses loopy probability propagation in undirected graphs. Learns conceptual relationships automatically and uses this learnt relationships to modify detection (e.g. Uses Outdoor Detection to influence Non-Studio Setting in the right proportion) q Discriminant Model Fusion using SVMs: Uses a training set of semantic feature vectors with ground truth to learn dependence of model outputs across concepts. q Discriminant Model Fusion AND Regression using Neural Networks and Boosting: Uses a training set of semantic feature vectors with ground truth to learn dependence of model outputs across concepts. Boosting helps especially with rare concepts. q Ontology-based processing: Use of the manually constructed annotation hierarchy (or ontology) to modify detection of root nodes based on robust detection of parent nodes. i.e. Use “Outdoor” detection to influence detection 17 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Semantic Context Learning and Exploitation: Multinet qProblem: Sky Building each concept model independently fails to utilize spatial, temporal and conceptual context and is sub-optimal use of available information. qApproach: Multinet: Network of Concept Models represented as a graph with undirected edges. Use of probabilistic graphical models to encode and enforce context. + 18 + + Landscape - + Person + conceptual Urban Setting + Face qResult: § Factor-graph multinet with Markov chain temporal models improve mean average precision by more than 27% over best IBM Run for TREC 2002 and 36 % in conjunction with SVM-DMF, §Highest MAP for TREC’03 § Low training cost § No extra training data needed § High inference cost § Fusion Validation Set II MAP: 0.268 § SearchTest MAP: 0.263 § References: Naphade et al (CIVR 2003, TCSVT 2002) + Indoors + Outdoors + + Greenery Multimedia Features + Tree + People + Road Transportation + Factor Graph Loopy Propagation Implementation CIVR’ 03 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Multi-Modality/ Multi-Concept Fusion Methods: DMF using SVM Using SVM/NN to re-classify the output results of Classifier 1-N. • No normalization required . • Use of Validation Set for training and Fusion Validation Set 1 for optimization and parameter selection. • Training Cost low when number of classifiers being fused is small (i.e. few tens?) • Classification cost low •Used for fusing together multiple concepts in the semantic feature-space methods. • Fusion Validation Set II MAP: 0.273 • SearchTest MAP: 0.247 • References: Iyengar et al (ICME 2002, ACM ‘03) 19 The IBM TREC-2003 Concept Detection Framework Concept Model X Concept X Annotation Ground-Truth Model vector space | | | | | | | | | M1 M2 M3 “model vector” M4 M5 M6 People © 2003 IBM Corporation IBM Research Multi-Concept Fusion: Semantic Space Modeling Through Regression Animal q q 0.17 Tree 0.01 Transportation -0.29 Sky 0.0 Road 0.48 Person -0.25 People 0.34 Outdoors -0.1 Landscape -0.02 Indoors Greenery 0.07 Problem: Given a (small) set of related concept exemplars, learn concept representation Approach: Learn and exploit semantic correlations and class co-dependencies § § § § § § § 20 -0.27 Face Building -0.19 Build (robust) classifiers for set of basis concepts (e.g., SVM models) Model (rare) concepts in terms of known (frequent) concepts, or anchors • Represent images as semantic model vectors, or vectors of confidences w.r.t. known models • Model new concepts as sub-space in semantic model vector space Learn weights of separating hyper-plane through regression: • Optimal linear regression (through Least Squares fit) • Non-linear MLP regression (through Multi-Layer Perceptron neural networks) Can be used to boost performance of basis models or for building additional models Fusion Validation Set II MAP: 0.274 SearchTest MAP: 0.252 References: Natsev et al (ICIP 2003) The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Multi-Concept Fusion: Ontology-based Boosting q Basic Idea § § § Concept hierarchy is created manually based on semantics ontology Classifiers influence each other in this ontology structure Try best to utilize information from reliable classifiers q Influence Within Ontology Structure § § § § § Boosting factor : Boosting children precision from more reliable ancestors (Shrinkage theory: Parameter estimates in data-sparse children toward the estimates of the datarich ancestors in ways that are provably optimal under appropriate condition) Confusion factor: The probability of misclassifying Cj into Ci , and Cj and Ci cannot coexist Fusion Validation Set II MAP: 0.266 SearchTest MAP: 0.261 References: Wu et al (ICME 2004 - submitted) Ontology Learning Outdoors Boosting factor Confusion factor Boosting factor Confusion factor Natural-vegetation Boosting factor 21 Tree Indoors Boosting factor Boosting factor Confusion factor Natural-non-vegetation Boosting factor GreeneryThe IBM TREC-2003 SkyConcept Detection CloudFramework Smoke Studio-setting Non-Studio-setting Boosting factor 2003 IBM Corporation House-setting ©Meeting-setting IBM Research Video Concept Detection Pipeline: Post-Filtering Annotation and Data Preparation Feature Extraction Low-level Feature-based Models SD/A CLG V1 CT V2 WT TAM Feature Extraction Videos MI V2 MV AUD : training only : training and testing 22 V1 BOU 47 Other Concepts EH Best Uni-model Ensemble Run The IBM TREC-2003 Concept Detection Framework VW Filtering Region Extraction Annotation CH 17 TREC Benchmark Concepts CC High-level (semantic) PostFusing Context based processing Models of Methods each concept across lowlevel featureMN based techniques DMF17 MLP EF MLP ONT EF2 DMF64 BOF Best Multi-modal Ensemble Run BOBO Best-of-the-Best Ensemble Run © 2003 IBM Corporation IBM Research Post Filtering - News/Commercial Detector Keyframes of a test video CNN template: Binary decision: news/non-news Match filter templates Median Filters News detection result q Match Filter: For each template: S = δ ( S C > τ C′ ) & δ ( S E > τ ′E ) ABC templates: § where C:Color: E: Edge, and 1 SC = ∑ δ (d ( PC , PMC ) > τ C ) N n 1 S E = ∑ δ (d ( PE , PME ) > τ E ) N n - Thresholds: τ C ,τ E ,τ C′ ,τ ′E were decided from two training videos. All templates use the same thresholds. Templates were Performance: Misclassification (Miss + False Alarm) in the arbitrarily chosen from 3 training Validation Set : videos. § CNN: 8 out of 1790 shots (accuracy = 99.6%) § ABC: 60 out of 2111 shots (accuracy=97.2%) § Our definition of news: news program shots (non-commercial, non-miscellaneous shots) 23 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research P@100 vs. Number of examples 120 P@100 (%) 100 Sport Event Nature NS-Face Car Weather 80 Aircraft 60 People Non-studio Outdoors Female Speech Building Road Animal 40 Zoom-in Albright 20 Physical Violence NS-Monologue 0 1 10 100 1000 10000 Number of Examples in Training Set (log scale) Performance is roughly log linear in terms of number of examples Yet there are deviations èCan Log-linear be considered the default to evaluate concept complexity? 24 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research TRECVID 2003 – Average Precision Values 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Best IBM M ea n Sp or tin g W ea the Ph ys Zo r ica o l_ m M ald Vio _In le eli ne nce _A lbr igh t Fe m ale An _S ima pe l C ec ar h /T ru ch /B us NS A _M ircr N on a -S ono ft tud log io_ ue Se ttin g R Ve oa ge d ta tio n O ut do or s N S_ Fa ce Pe op le Bu ild ing Best NonIBM q IBM has the best Average Precision at 14 out of the 17 concepts q The best Mean Average Precision of IBM system (0.263) is 34 percent better than the second best q Pooling skews some AP numbers for high-frequency concepts so it makes judgement difficult but can be considered a loose lower bound on performance. q Bug in Female_Speech model affected second level fusion of Female_Speech, News_Subject_Monologue, Madeleine_Albright among others. This was especially hurting the model-vector-based techniques (DMF, NN, Multinet, Ontology) 25 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research TRECVID 2003 -- Precision at Top 100 Returns 100 90 80 70 60 50 40 30 20 10 0 Best IBM M ea n Fe m ale An _S im pe al C ec ar h /T ru ck /B us N S_ Ai N on Mo rcra ft n -S tu olo dio gu _S e et tin Sp g or ts_ Ev en t W ea the Ph r ys Z ica oo l_ m M ald Vio _In len eli ne c _A e lbr igh t R Ve oa ge d ta tio n O ut do or s N S_ Fa ce Pe op le Bu ild ing Best NonIBM qIBM has the highest Precision @ 100 in 13 out of the 17 concepts qMean Precision @ 100 of Best IBM System 0.6671 qThe best Mean Precision of IBM system is 28 percent better than the other systems. qDifferent Model-vector based fusion techniques improve performance for different classes of concepts 26 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Precision of 10 IBM Runs Submitted OutdoorsNSFace People Building Road Vege. Animal F_Speech Vehicle AircraftMonol. NonStudio Sports Weather Zoom_In Violence Albright BOU 81 80 90 53 46 96 10 46 68 38 24 97 81 79 44 33 32 EF 67 77 95 60 33 97 47 69 80 63 25 96 99 98 44 28 28 BOF 71 77 97 71 52 93 47 69 80 47 25 96 98 100 44 35 32 DMF17 82 93 90 54 49 97 45 35 76 70 1 99 98 99 44 9 28 DMF64 82 73 79 53 41 96 33 79 56 67 0 93 98 99 44 34 4 MLP_BOR 78 75 97 61 53 94 47 38 70 65 1 95 100 97 44 27 30 MLP_EFC 73 67 97 41 33 96 48 19 49 60 3 97 99 99 44 27 27 MN 85 55 99 52 45 97 47 66 81 63 25 96 99 98 44 22 28 ONT 67 77 95 56 42 97 47 69 83 69 6 94 99 98 44 28 28 BOBO 85 73 99 56 52 93 10 66 56 63 0 97 98 99 44 22 32 Maximum: 85 93 99 71 53 97 48 79 83 70 25 99 100 100 44 35 32 Average: 76.857 73.857 93.429 55.429 45 95.71 44.857 53.571 70.714 63 8.714 95.71429 98.71 98.5714 44 26 25.286 Mean 58.706 65.059 66.706 62.882 60.647 63.059 57.588 64.824 64.647 61.471 66.706 62.908 q Processing beyond single classifier per concept improves performance q If we divide TREC Benchmark concepts into 3 types based on frequency of occurrence § § § Performance of Highly Frequent (>80/100) concepts is further enhanced by Multinet (e.g. Outdoors, Nature_Vegetation, People etc.) Performance of Moderately Frequent concepts (>50 & < 80) is usually improved by discriminant reclassification techniques such as SVMs (DMF17/64) or NN (MLP_BOR, MLP_EFC) Performance of very rare concepts needs to be boosted through better feature extraction and processing in the initial stages. q Based on Fusion Validation Set 2 evaluation, visual models outperform audio/ASR models for 9 concepts while the reverse is true for 6 concepts. q Semantic-feature based techniques improve MAP by 20 % over visual-models alone. q Fusion of multiple modalities (audio, visual) improves MAP by 20 % over best unimodal (visual) run (using Fusion Validation Set II for comparison) 27 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Observations and Future Directions qGeneric Trainable Methods for Concept Detection demonstrate impressive performance. qNeed to increase Vocabulary of Concepts Modeled qNeed to improve Modeling of Rare Concepts qNeed Multimodality at an earlier level of analysis (e.g. multimodal model of Monologue (TREC’02) better than fusion of multiple unimodal classifiers (TREC’03) qMulti-classifier, Multi-concept and Multi-modal fusion offer promising improvement in detection (as measured on TREC’02 and TREC’03 Fusion Validation Set 2 and in part also by TREC SearchTest 03) 28 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Acknowledgements q Thanks for additional contributions from: § Chitra Dorai (IBM) for Zoom-In Detector, § Javier Ruiz-del-Solar (Univ. of Chile) for Face Detector, § Ishan Sachedv (summer intern – MIT) for helping with Visual uni-models, § For collaborative annotation: • IBM -- Ying Li, Christrian Lang, Ishan Sachedv, Larry Sansone, Matthew Hill, • Columbia U. -- Winston Hsu • Univ. of Chile – Alex Jaimes, Dinko Yaksic, Rodrigo Verschae 29 The IBM TREC-2003 Concept Detection Framework © 2003 IBM Corporation IBM Research Concept Detection Example: Cars q “Car/truck/bus: segment contains at least one automobile, truck, or bus exterior” BOF 68 1 4 32 36 q Concept was trained on the annotated training set. q Results are shown on the test set Run Precision @100 Best IBM 30 0.83 The IBM TREC-2003 Concept Detection Framework 100 © 2003 IBM Corporation IBM Research Concept Detection Example: Ms. Albright q “Person X: segment contains video of person x (x = Madeleine Albright).” 1 4 q Contributions of the Audio-based Models and Visual-based Models -- Results at the CF2 (validation set) Run Average Precision Best IBM Audio Models 0.30 Best IBM Visual Models 0.29 Best of Fusion 0.47 q Results are shown on the test set TREC Evaluation by NIST Run Best IBM Precision 0.32 21 31 The IBM TREC-2003 Concept Detection Framework 24 © 2003 IBM Corporation