Dynamic Multimodal Fusion in Video Search Lexing Xie IBM T J Watson Research Center joint work with Apostol Natsev, Jelena Tesic, Rong Yan, John R Smith WOCC-2007, April 28th, 2007 © 2007 IBM The Multimedia Search Problem Media repository multi-modal query produced video corpra Online media shares “Find shots of flying through snow capped mountains” Personal media Digital life records Impact – Widely applicable: consumer, media, enterprise, web, science … – Bridging Traditional Search and Multimedia Knowledge Extraction Multiple Tipping Points: multimedia semantics learning, multimedia ontology, training and evaluation, … NIST TRECVID benchmark – Validation for emerging technologies – Scaling video retrieval technologies to large-scale applications © 2007 IBM Corporation Multimedia Search: Query Topics overview Query Types Query topic Query Specificity Generic Specific/Named Palm trees, tanks, ship/boat Find objects Topic 13. Speaker talking in front of the US flag People shaking hands, George W. Bush, Condoleeza people with banners, people Rice, Tony Blair, Iyad Allawi, in a meeting, people entering Omar Karami, Hu Jintao, or leaving a building Mahmoud Abbas Find people Topic 4. Scenes of snow capped mountains Soccer score, basketball, tennis, airplane take-off, helicopter in flight Find events Find sites Topic 48. Other examples of overhead zooming-in views of canyons in Western United States. Tall buildings, office setting, Map of Iraq showing Baghdad road with cars, fire © 2007 IBM Corporation Outline The multimedia search challenge A birds-eye’s view of IBM Multimedia Search System Query-dependent multimodal fusion Evaluation on TRECVID benchmark Summary © 2007 IBM Corporation IBM Multimedia Search System Overview Query Approaches: Textual query formulation “Find shots of an airplane taking off” Query topic examples LSCOM-lite Semantic Models TEXT VISUAL FEATURES 1. Text-based: story-based retrieval with automatic query refinement/reranking 2. Model-based: automatic query-tomodel mapping based on query topic text 3. Semantic-based: cluster-based semantic space modeling and data selection 1 Text-Based Retrieval 2 3 Model-Based Retrieval Semantic Based Retrieval 4 Visual-Based Retrieval 4. Visual-based: light-weight learning (discriminative and nearest neighbor modeling) with smart sampling 5. Fusion: – Query independent 5 Fusion Multi-modal results (Text + Model + Semantic + Visual) – Query-class-dependent with soft, hard or dynamic class membership © 2007 IBM Corporation 1 IBM Text Retrieval System Corpus Indexing Textual query topic “Find shots of an airplane taking off” Query Query analysis: Query analysis Tokenization, phrasing, stemming Part-of-speech tagging & filtering “airplane”, “take off” Shot-based query refinement “pilot” Shot-based refined query execution ranked shots Shot-level ASR/MT documents Story-level ASR/MT documents Both aligned at phrase level Story-based query refinement “crash” Story-based refined query execution ranked stories Shot-level score propagation & ranking ranked shots Shot-level fusion and re-ranking Final text-based ranking of shots Query refinement: Pseudo-Relevance Feedback Query execution and fusion IBM Semantic Search Engine (Juru) Using TF*IDF-based retrieval Fusion of shot- and story-based results Performance (MAP): 2005: 0.09873 2006: 0.05169 T. Volkmer et al, ICME 2006 © 2007 IBM Corporation 2 IBM Model-based Retrieval System Textual query topic “Find shots of an airplane taking off” Lexical (WordNet) Approach Query Query analysis – Tokens, stems, phrases – Stop word removal Query Analysis Tokenizer “airplane”, “take off” WordNet Query Refinement (Query-to-Concept Mapping & Weighting) Concept Lexicon Outdoors (1.0), Sky (0.8), Military (0.6) Query Execution (Concept-Based Retrieval) Concept Models Repository Model-based ranking Query refinement – Automatic mapping of query text to concept models & weights – Lexical approach based on WordNet Lesk similarity Query execution – Concept-based retrieval using statistical concept models – Weighted averaging model fusion Performance (MAP) 2006: 0.029 Haubold et al. (ICME 2006) © 2007 IBM Corporation 4 IBM Visual Retrieval System Visual query examples (airplane take-off) Visual concepts can help refine query topics Extract visual features Content-based over-sampling of visual query examples Selection Content-based downsampling of test set Atomic CBR SVM run for each bag OR Fusion Extract visual features AND Fusion SVM learning MECBR SVM Fusion Data modeling for discriminative learning: – Content-based over-sampling of visual query to create positive examples – Content –based pseudo-negative sampling in the development and test set. Fusion Visual Run © 2007 IBM Corporation 3 IBM Visual-Semantic Retrieval Map Visual examples into Semantic Space Compute Semantic Map query examples to Concept Scores Semantic Space LSCOM-lite: •Sky •Road •Maps – Visual context can disambiguate word senses Sky X Airplane O Road Cluster-based Data Modeling Approach and Sample Selection from testset PPPP P PPPP P PPPP P NNNNN NNNN NNNNN NNNN NNNNN NNNN Semantic-based confidence list Semantic Concept Lexicon – Hierarchy of 39 LSCOM-lite concepts – Statistical models based on visual features and machine learning – Dimensions correspond to LSCOMlite AND fusion of primitive SVM runs N Idea – Leverage automatic visual concept detectors for semantic model-based query refinement Primitive SVM run for each bag NN N N N NN P P P N N P P P N N N N N NN N N N N Observations We model query topic examples in semantic space – Any descriptor space © 2007 IBM Corporation 5 Multi-modal Fusion Training queries semantic query analysis Text Test queries Model query matching Semantic Visual search for weights Text Model Semantic Visual weighted fusion Each modality is good in finding certain things – Text: named people, other named entity – Visual: semantic scenes consistent in color or layout, e.g., sports, weather – Concept Models: non-specific queries, e.g. protest, boat, fire Averaging the retrieval models helps in any case Query-class or query-cluster dependent fusion helps more [CMU, NUS, Columbia] Query soft/hard/dynamic class approach – Semantic query analysis for query matching – Matching based on PIQUANT-II Q&A text features) – Weighted linear combination for fusion – Training Queries: TRECVID2005 Performance MAP – 2006: 0.0756 -> 0.087 -> 0.0937 © 2007 IBM Corporation Extract Semantic Query Features Input Query Text Semantic Tagging “people with” computer display Semantic Categories Query Feature Vector Person:CATEGORY BodyPart:UNKNOWN Furniture:UNKNOWN … [IBM UIMA and PIQUANT Analysis Engine] © 2007 IBM Corporation Three Query-dependent Fusion Strategies © 2007 IBM Corporation Outline The multimedia search challenge A birds-eye’s view of IBM Multimedia Search System Query-dependent multimodal fusion Evaluation and demo on TRECVID benchmark Summary © 2007 IBM Corporation NIST TRECVID Benchmark at a Glance NIST benchmark for evaluating state-of-the-art in video retrieval Benchmark tasks: – Semantic Concept Detection; Search (2003-); Rushes (2005-) – Shot detection and story segmentation ipation ic t r a P g Growin TRECVID 2001 TRECVID 2002 12* Participants 17 Participants 11 Hours NIST video 73 Hours Video from Prelinger archives Growing D TRECVID 2003 24 Participants 133 Hours 1998 ABC, CNN news & C-SPAN TRECVID 2004 38 Participants 173 Hours 1998 ABC, CNN news & C-SPAN ata Sets TRECVID 2005 42* Participants 220 Hours of 2004 news from U.S., Arabic, Chinese sources 50 hours BBC stock shots (vacation spots) TRECVID 2006 54* Participants 380 Hours of 2004 and 2005 news from U.S., Arabic, Chinese source 100 hours BBC rushes (interviews) * Number of participants that completed at least one task © 2007 IBM Corporation Automatic/Manual Search Overall Performance (Mean AP) IBM Official Runs: TRECVID06 Automatic and Manual Search (Aggre gated Pe rfomance of IBM vs. Othe rs) IBM Runs 0.10 Mean Average Precision 0.08 IBM Automatic Search Other Automatic Search Other Manual Search Text (baseline): Text (story-based): 0.041 0.052 Multimodal Fusion: Query independent: 0.076 Query classes (soft): 0.086 Query classes (hard): 0.087 0.06 IBM Modal Runs: Semantic-based Run: 0.037 Visual-based Run: 0.0696 0.04 0.02 0.00 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 Submitted Runs Multi-modal fusion doubles baseline performance! Visual and semantic retrieval have a significant impact over a text baseline © 2007 IBM Corporation TRECVID’06 Fusion Results (Relative Performance) Relative improvement (%) – query-class fusion vs. query-independent fusion Observations QindQclass – Concept-related queries improved the most: “tall building”, “prisoner”, “helicopters in flight”, “soccer” – Named-entity queries improved slightly: “Dick Cheney”, “Saddam Hussein”, “Bush Jr.” – Generic people category deteriorated the most: “people in formation”, “at computer display”, “w/ newspaper”, “w/ books” 200 150 100 50 ow es pe op le & sn in bu i ld cl e gs ve hi co cl r ti es ng a pr pr is ot on es er t& bu i ld in gs D ic k Sa C he dd ne am y H us pe se op in le +1 in fo rm Bu at sh io n ,J r. wa so lk in ld g ie rs or po w l ic at e er bo at -s co hi m p pu te re rd ad is in pl g ay a ne w sp ap a er na tu ra ls he ce l ic ne op te rs bu in rn fli in gh fo g t ur w ith su f la ite m d es pe op pe le rs w .f on la an g d 10 bo ok ad s ul ta nd ki ch ss il d on th e ch ee sm k ok es ta C ck on s do le ez a so R ic cc e er go al po st -150 ll nc y ge em er -100 ve hi -50 ta AV G s 0 © 2007 IBM Corporation Improvement over TREC06’ and TREC’05 Qdyn and Qcomp > Qclass >> Qind New improvements come from the increased weights on concepts for object queries (next slide). Improved queries … © 2007 IBM Corporation Results: Improved upon TRECVID’06 on generic people queries. © 2007 IBM Corporation Demo © 2007 IBM Corporation © 2007 IBM Corporation © 2007 IBM Corporation © 2007 IBM Corporation © 2007 IBM Corporation Thank You For more information: IBM Multimedia Retrieval Demo http://mp7.watson.ibm.com/ “Dynamic Multimodal Fusion for Video Search”, by L. Xie, A. Natsev, J. Tesic, ICME 2007, to appear “IBM research TRECVID-2006 video retrieval system”, by M. Campbell, S. Ebadollahi, M. Naphade, A. P. Natsev, J. R. Smith, J. Tesic, L. Xie, K. Scheinberg, J. Seidl, and A. Haubold, in NIST TRECVID Workshop, November 2006. © 2007 IBM Corporation