IBM Research TRECVID-2007 Video Retrieval System Apostol (Paul) Natsev IBM T. J. Watson Research Center Hawthorne, NY 10532 USA On Behalf Of: Murray Campbell, Alexander Haubold, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan © 2007 IBM Corporation Acknowledgement IBM Research UIUC – Murray Campbell – Ming Liu – Matthew Hill – Xu Han – Apostol (Paul) Natsev – Xun Xu – Quoc-Bao Nguyen – Thomas Huang – Christian Penz – John R. Smith – Jelena Tesic Columbia Univ. – Alexander Haubold – Lexing Xie – Rong Yan Carnegie Mellon Univ. – Jun Yang 2 © 2007 IBM Corporation Part 1 of 2: Automatic Search © 2007 IBM Corporation Outline Overall System Summary Review of Baseline Retrieval Experts Performance Analysis Summary (Repeated) 4 © 2007 IBM Corporation System Overview Features Retrieval Experts Fusion Runs Speech Speech Queries: -Text blurb - Examples Concepts Videos: Videos: -Keyframes -Keyframes -- ASR/MT ASR/MT Grid Grid Features Features Global Global Features Features Color Color Corr. Corr. 5 Color Color Hist. Hist. Visual Visual Search Search V (Bagged (Bagged SVM) SVM) Edge Edge Hist. Hist. Cooc. Cooc. Texture Texture Concept Search Color Color Mom. Mom. Model Vectors (Bagged SVM) Wav. Wav. Texture Texture TV07-36 TV07-36 LSCOM-155 LSCOM-155 MediaNet-50 MediaNet-50 Text-LR5 Text-LR5 Text-LR10 Text-LR10 S Statistical Map (co-occurrence) M1 WordNet Map (Brown stats) M2 WordNet Map (web stats) M3 Text Text Search Search T Query-indep. Query-indep. (Average) (Average) MSV.avg Query-indep. Query-indep. (Average) (Average) TMSV.avg Query-dynamic Query-dynamic TMSV.qdyn (Weighted (Weighted Avg) Avg) Query-dynamic Query-dynamic TMSVC.qdyn (Weighted (Weighted Avg) Avg) Query-indep. Query-indep. (Average) (Average) Text © 2007 IBM Corporation Summary: What Worked and What Didn’t? Baseline retrieval experts Grade Speech/text-based Concept-based (statistical mapping) Concept-based (WordNet mapping, Brown corpus stats) Concept-based (WordNet mapping, web stats) Concept-based (query examples modeling) Visual-based Experiments 6 Improving query-to-concept mapping: web-based stats Leveraging external resources: type C runs Query-dynamic multimodal fusion © 2007 IBM Corporation Baseline Retrieval Experts: Review of Approaches Speech/Text-Based Retrieval – Auto-query expansion with JuruXML search engine (Volkmer et al., ICME’06) Visual-Based Retrieval – Bagged SVM modeling of query topic examples (Natsev et al., MM’05) Concept-Based Retrieval (G2 Statistical Map) – Based on co-occurrence of ASR terms and concepts (Natsev et al., MM’07) Concept-Based Retrieval (WordNet Map, Brown Stats) – Based on JCN similarity, IC from Brown Corpus (Haubold et al., ICME’06) Concept-Based Retrieval (WordNet Map, Web Stats) – Based on JCN similarity, IC from WWW sample Concept-Based Retrieval (Content-Based Modeling) – Bagged SVM modeling of query topic examples (Tesic et al., CIVR’07) 7 © 2007 IBM Corporation Improving Query-to-Concept Mapping WordNet similarity measures – Frequently used for query-to-concept mapping – Frequently based on Information Content (IC) to model term saliency • Resnik: evaluates information content (IC) of common root • Jiang-Conrath: evaluates IC of common root and ICs of terms – IC typically estimated from 1961 Brown Corpus • IC from Brown Corpus outdated by >40 years • 76% of words in WordNet not in Brown Corpus (so IC = 0) Idea: create approximation using WWW – Perform frequency analysis over large sample of web pages – Google page count as indicator of frequency 8 © 2007 IBM Corporation IC from Large-Scale Web Sampling Sample of web pages: 9 Google page count: – 1,231,929 documents (~1M) – As a proxy to term frequency counts – 1,169,368,161 WordNet terms (~1B) – Term frequency ≈ Doc. Frequency? – Distribution similar to Brown Corpus – Experiments show linear relationship – Therefore: potentially useful as IC – Therefore: Potentially useful as IC © 2007 IBM Corporation Outline Overall System Summary Review of Baseline Retrieval Experts Performance Analysis Summary (Repeated) 10 © 2007 IBM Corporation Evaluation Results: Baseline Retrieval Experts Comparison of Unimodal Retrieval Experts Speech/Text-Based Concepts (WordNet Web Map) 0.035 Concepts (Co-occurrence Map) Concepts (WordNet Brown Map) Concepts (Content-Based) Visual Content-Based 0.0302 Mean Average Precision 0.030 0.0238 0.025 0.0228 0.020 0.015 0.0185 0.0177 0.0149 0.0146 0.0144 0.0152 0.0162 0.010 0.005 0.005 0.005 0.000 All topics 11 All topics but 219 Statistical concept-based run did not generalize Web-based IC led to 20% improvement in WordNet runs Concept-based runs performed better than speech/text-based runs Content-based runs performed best © 2007 IBM Corporation 12 ch ar ac te c r st re b row et ic d bu yc l m ild i e ou ng st n s m re ta us i ic a w et n n s i ro l in a te g ht ad str rfr fr o u m ont st m e nt re v e s e h do t pr ic l or ot e op est en ed c st an bo a re al t et ri m ve wo ar r m a n b ket ri pe sh in te d ge rs ee rv o n p ie te go w pe lep ats op ho l n c la e ta e ss ble ro om pe op tr le ain k e sta pe y irs op b o le ard do gs Co ok Relative contribution per expert Baseline Retrieval Experts: JAWS Analysis Comparison of Unimodal Retrieval Experts Speech/Text-Based Concepts (Co-occurrence Map) Concepts (WordNet Brown Map) Concepts (WordNet Web Map) Concepts (Content-Based) Visual Content-Based 100% 80% 60% 40% 20% 0% For the adventurous: – Stacked chart showing contribution of each expert per topic © 2007 IBM Corporation Summary of Submitted and Unsubmitted IBM Runs Description Code Run ID Text baseline T Text A 0.0149 Concept baseline (stat) MS - A 0.0045 Concept baseline (WordNet, Brown stats) MB - A 0.0146 Concept baseline (WordNet, Web stats) MW - C 0.0177 Concept baseline (content-based, type A) S - A 0.0228 Concept baseline (content-based, type C) SC - C 0.0249 Visual baseline V - A 0.0302 Non-text baseline (MS + S + V) MSSVavg MSV A 0.0213 Non-text baseline (MB + S + V) MBSVavg - A 0.0301 Query-dynamic fusion, type A (T + MS + S + V) TMSSVqdyn TMSV.qdyn A 0.0210 Query-dynamic fusion, type C (T + MSW + SC + V) TMSWSCVqdyn TMSV-C.qdyn C 0.0272 Multimodal AVG fusion, type A (T + MS + S + V) TMSSVavg TMSV A 0.0231 Multimodal AVG fusion, type C (T + MSW + SC + V) TMSWSCVavg - C 0.0303 Multimodal AVG fusion, type C (T + MW + SC + V) TMWSCVavg - C 0.0372 Oracle fusion (pick best run for each topic) Oracle - C 0.0638 13 Type MAP © 2007 IBM Corporation Overall Performance Comparison Mean Average Precision Automatic Search (Best-of Team Performance) 0.10 0.09 Best IBM Oracle Fusion Run Best IBM Unsubmitted Run Best IBM Submitted Run 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 F_A_1_JTU_FA_1_1 F_A_1_UTen_1 F_A_2_owa07AS03_2 F_A_2_UTwiki-t2nm_2 F_C_1_NAI_1 F_A_2_SC-2_2 F_A_1_SRG_KN2_2 F_C_1_UGSys_2 F_C_2_IBM-TMSVC.qdyn_3 F_A_2_OM_6_1 F_A_1_CSB_visbl_1 F_B_1_A-MM_5 F_A_2_SION_3 F_A_2_yUHK_SCS_2 F_A_2_IBM-Unofficial F_A_2_ua_4 F_A_1_cuImgBaseline_4 F_A_2_CT_3 F_A_2_RA-USTCSJTU-SEARCH_1 F_C_IBM_Oracle 0.00 Best run from each organization shown Submitted IBM runs in third tier, improve by dropping failed run IBM runs achieve highest AP scores on 5 of 24 topics 14 © 2007 IBM Corporation IBM Automatic Search 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% st re e s ro h e t n ad ep igh f ro g t m oat s s pe tre veh rs e t i c on m le te ark le p et ca h o n do nal e o r ri v op er en ed b i m cy o st un cle re ta e t in pr s st w ote re at st et er b u f ro ild nt in C gs oo k cro ch w ar d ac m te us r ica bo li at ns t t ru ra m in en ts w cl brid om as g a n sro e i n om pe t erv o ie pe ple w t op a b pe le s le t op ai le rs d ke o g yb s oa rd Performance Relative to Best Per-Topic Run (%) IBM Performance Relative to Best AP Per Topic Good (>90%): street night, street market, sheep/goat, road/vehicle, people telephone, canal/river So-so (40-60%): mountains, waterfront, boat, train, crowd, street protest, street buildings Bad (<25%): people table, people stairs, people dogs, people keyboard 15 © 2007 IBM Corporation Summary: What Worked and What Didn’t? Baseline retrieval experts Grade Speech/text-based Concept-based (statistical mapping) Concept-based (WordNet mapping, Brown corpus stats) Concept-based (WordNet mapping, web stats) Concept-based (query examples modeling) Visual-based Experiments 16 Improving query-to-concept mapping: web-based stats Leveraging external resources: type C runs Query-dynamic multimodal fusion © 2007 IBM Corporation Part 2 of 2: Interactive Search © 2007 IBM Corporation Annotation-based Interactive Retrieval Model interactive search as a video annotation task – Consider each query topic as a keyword, annotate video keyframes – Extend from CMU’s Extreme Video Retrieval system [Hauptmann et al., MM’06] Hybrid annotation system – Minimize annotation time by leveraging two annotation interfaces – Tagging: Flickr, ESP game [von Ahn et al., CHI’04] – Browsing: IBM’s EVA [Volker et al., MM’05], CMU’s EVR [Hauptmann et al., MM’06] Formal analysis by modeling the interactive retrieval process – Tagging-based annotation time per shot – Browsing-based annotation time per shot 18 © 2007 IBM Corporation Manual Annotation (I) : Tagging Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches) Advantages – Freely choose arbitrary keywords to annotate – Only need to annotate relevant keywords Disadvantages – “Vocabulary mismatch” problem – Inefficient to design and type keywords Suitable for annotating rare keywords 19 © 2007 IBM Corporation Formal Time Model for Tagging Key factors for tagging time model – Number of keywords K for image l – Annotation time for kth word t’fk – Initial setup time for a new image t’s – Noise term ε (zero-mean distribution) Annotation time for one image T = t ′f 1 + ... + t ′fk + ts′ + ε = ∑ t ′fk + ts′ + ε k Total expected annotation time for the entire collection containing L images – Assumption: the expected time to annotate the kth word is constant tf E (Ttotal ) = ∑∑ E t ′fkl + E ( ts′ ) = ∑ K l t f + ts ( ) l 20 kl l © 2007 IBM Corporation Validation of Tagging Time Model User study on TRECVID’05 development data – A user to manually tag 100 images using 303 keywords – If the model is correct, a linear fit should be found in the results – The annotation results fit the model very well 21 © 2007 IBM Corporation Manual Annotation (II) : Browsing Allow users to associate multiple images with a single word at the same time Advantages – Efficient to annotate each pair of images and words – No “vocabulary mismatch” Disadvantages – Need to spend time on judging both relevant and irrelevant pairs – Start with controlled vocabulary – Annotate one keyword at a time Suitable for annotating frequent keywords 22 © 2007 IBM Corporation Formal Time Model for Browsing Key factors for browsing time model – Number of relevant images Lk for a word k – Annotation time for a relevant image t’p – Annotation time for an irrelevant image t’n – Noise term ε (zero-mean distribution) Annotation time for all images w.r.t. a keyword L − Lk Lk T = ∑ t′ pl l =1 + ∑ t nl′ + ε l =1 Total expected annotation time for the entire collection containing L images – Assumption: the expected time to annotate a relevant (irrelevant) image is constant E (Ttotal ) = ∑ ∑ E t ′plk + ∑ E tnl′ k = ∑ ( Lk t p + ( L − Lk ) tn ) k lk lk k ( ) 23 ( ) © 2007 IBM Corporation Validation of Browsing Time Model User study on TRECVID’05 development data – Three users to manually browse images in 15 minutes (for 25 keywords ) – If the model is correct, a linear fit should be found in the results – The annotation results fit the model for all users 24 © 2007 IBM Corporation Video Retrieval as Hybrid Annotation Jointly annotates all topics at the same time Switches between tagging and browsing annotation interfaces in order to minimize the overall annotation time – Formally model the annotation time as functions of word frequency, time per word, and annotation interfaces – Online machine learning to select images, topics and interfaces based on the annotation models More details will be released in the final notebook paper – See related analysis in R. Yan et al. [ACM MM 2007 Workshop on Many Faces of Multimedia] 25 © 2007 IBM Corporation TRECVID-2007 Performance Analysis The proposed approach allows users to annotate 60% of the image-topic pairs, as compared with ~10% allowed by simple browsing Balance between tagging & browsing: 1529 retrieved shots from tagging, 797 retrieved shots from browsing Simple temporal expansion can improve MAP from 0.35 to 0.43 Mean Average Precision 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 26 -E ER I_ C _1 _2 _V G G _I _1 _1 I_ B _2 _A -M M _1 I_ A _2 _C T_ I_ A 1 _2 _A L_ C O _3 I_ A _2 I_ A _u _2 a_ _a 6 ce -s ho t-E -2 _2 I_ A _2 _d an _P _1 I_ a_ 2_ SC -1 _1 I_ C _1 _t 29 2. IT I_ 2 I_ c_ 2_ U Q 2_ 2 _2 _J W I_ A I_ A _E ER _E xp an d 0.00 © 2007 IBM Corporation st re s ro h et n ad ee ig f ro p g ht m oa pe stre veh ts rs e t i c on m le te ark le p et ca ho do na ne o r l ri op ver en e b d m icy st oun cle re ta e t in pr s st w ote re at st e t er b u f ro ild nt in C gs oo c k ch row ar d ac m te us r ica bo li at ns t ru tra m in en ts w c bri om la dg an ssr e i n oo pe t er m o vi pe ple ew op t a b pe le s le op t a le irs d ke og yb s oa rd Average Precision TRECVID-2007 Per-Query Performance Analysis 1.00 0.90 27 Best IBM Best Non-IBM 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Good: “telephone”, “interview”, “mountains” Not as good: “canal”, “bicycle”, “classroom”, “dog” Highest AP scores on 10 out of 24 topics © 2007 IBM Corporation Potential Improvement Search beyond keyframe (look at the I-frame) Finding Sheep/Goats Search beyond I-frame (leverage text info or temporal context) Finding Sheep/Goats Better online learning algorithms to search for more shots in the same amount of time (include temporal information) 28 © 2007 IBM Corporation