IBM Research TRECVID-2008 Video Retrieval System.

IBM Research TRECVID-2007
Video Retrieval System
Apostol (Paul) Natsev
IBM T. J. Watson Research Center
Hawthorne, NY 10532 USA
On Behalf Of:
Murray Campbell, Alexander Haubold, John R. Smith,
Jelena Tesic, Lexing Xie, Rong Yan
© 2007 IBM Corporation
Acknowledgement
IBM Research
UIUC
– Murray Campbell
– Ming Liu
– Matthew Hill
– Xu Han
– Apostol (Paul) Natsev
– Xun Xu
– Quoc-Bao Nguyen
– Thomas Huang
– Christian Penz
– John R. Smith
– Jelena Tesic
Columbia Univ.
– Alexander Haubold
– Lexing Xie
– Rong Yan
Carnegie Mellon Univ.
– Jun Yang
2
© 2007 IBM Corporation
Part 1 of 2:
Automatic Search
© 2007 IBM Corporation
Outline
Overall System
Summary
Review of Baseline Retrieval Experts
Performance Analysis
Summary (Repeated)
4
© 2007 IBM Corporation
System Overview
Features
Retrieval Experts
Fusion
Runs
Speech
Speech
Queries:
-Text blurb
- Examples
Concepts
Videos:
Videos:
-Keyframes
-Keyframes
-- ASR/MT
ASR/MT
Grid
Grid
Features
Features
Global
Global
Features
Features
Color
Color Corr.
Corr.
5
Color
Color Hist.
Hist.
Visual
Visual Search
Search
V
(Bagged
(Bagged SVM)
SVM)
Edge
Edge Hist.
Hist.
Cooc.
Cooc. Texture
Texture
Concept Search
Color
Color Mom.
Mom.
Model Vectors
(Bagged SVM)
Wav.
Wav. Texture
Texture
TV07-36
TV07-36
LSCOM-155
LSCOM-155
MediaNet-50
MediaNet-50
Text-LR5
Text-LR5
Text-LR10
Text-LR10
S
Statistical Map
(co-occurrence)
M1
WordNet Map
(Brown stats)
M2
WordNet Map
(web stats)
M3
Text
Text Search
Search
T
Query-indep.
Query-indep.
(Average)
(Average)
MSV.avg
Query-indep.
Query-indep.
(Average)
(Average)
TMSV.avg
Query-dynamic
Query-dynamic TMSV.qdyn
(Weighted
(Weighted Avg)
Avg)
Query-dynamic
Query-dynamic TMSVC.qdyn
(Weighted
(Weighted Avg)
Avg)
Query-indep.
Query-indep.
(Average)
(Average)
Text
© 2007 IBM Corporation
Summary: What Worked and What Didn’t?
Baseline retrieval experts
Grade
Speech/text-based
Concept-based (statistical mapping)
Concept-based (WordNet mapping, Brown corpus stats)
Concept-based (WordNet mapping, web stats)
Concept-based (query examples modeling)
Visual-based
Experiments
6
Improving query-to-concept mapping: web-based stats
Leveraging external resources: type C runs
Query-dynamic multimodal fusion
© 2007 IBM Corporation
Baseline Retrieval Experts: Review of Approaches
Speech/Text-Based Retrieval
– Auto-query expansion with JuruXML search engine (Volkmer et al., ICME’06)
Visual-Based Retrieval
– Bagged SVM modeling of query topic examples (Natsev et al., MM’05)
Concept-Based Retrieval (G2 Statistical Map)
– Based on co-occurrence of ASR terms and concepts (Natsev et al., MM’07)
Concept-Based Retrieval (WordNet Map, Brown Stats)
– Based on JCN similarity, IC from Brown Corpus (Haubold et al., ICME’06)
Concept-Based Retrieval (WordNet Map, Web Stats)
– Based on JCN similarity, IC from WWW sample
Concept-Based Retrieval (Content-Based Modeling)
– Bagged SVM modeling of query topic examples (Tesic et al., CIVR’07)
7
© 2007 IBM Corporation
Improving Query-to-Concept Mapping
WordNet similarity measures
– Frequently used for query-to-concept mapping
– Frequently based on Information Content (IC) to model term saliency
• Resnik: evaluates information content (IC) of common root
• Jiang-Conrath: evaluates IC of common root and ICs of terms
– IC typically estimated from 1961 Brown Corpus
• IC from Brown Corpus outdated by >40 years
• 76% of words in WordNet not in Brown Corpus (so IC = 0)
Idea: create approximation using WWW
– Perform frequency analysis over large sample of web pages
– Google page count as indicator of frequency
8
© 2007 IBM Corporation
IC from Large-Scale Web Sampling
Sample of web pages:
9
Google page count:
– 1,231,929 documents (~1M)
– As a proxy to term frequency counts
– 1,169,368,161 WordNet terms (~1B)
– Term frequency ≈ Doc. Frequency?
– Distribution similar to Brown Corpus
– Experiments show linear relationship
– Therefore: potentially useful as IC
– Therefore: Potentially useful as IC
© 2007 IBM Corporation
Outline
Overall System
Summary
Review of Baseline Retrieval Experts
Performance Analysis
Summary (Repeated)
10
© 2007 IBM Corporation
Evaluation Results: Baseline Retrieval Experts
Comparison of Unimodal Retrieval Experts
Speech/Text-Based
Concepts (WordNet Web Map)
0.035
Concepts (Co-occurrence Map)
Concepts (WordNet Brown Map)
Concepts (Content-Based)
Visual Content-Based
0.0302
Mean Average Precision
0.030
0.0238
0.025
0.0228
0.020
0.015
0.0185
0.0177
0.0149
0.0146
0.0144
0.0152
0.0162
0.010
0.005
0.005
0.005
0.000
All topics
11
All topics but 219
Statistical concept-based run did not generalize
Web-based IC led to 20% improvement in WordNet runs
Concept-based runs performed better than speech/text-based runs
Content-based runs performed best
© 2007 IBM Corporation
12
ch
ar
ac
te
c
r
st
re b row
et ic d
bu yc
l
m ild i e
ou ng
st n s
m
re ta
us
i
ic a w et n n s
i
ro l in a te g ht
ad str rfr
fr o u m ont
st m e nt
re v e s
e h
do t pr ic l
or ot e
op est
en
ed
c
st an bo a
re al t
et ri
m ve
wo
ar r
m
a n b ket
ri
pe sh in te d ge
rs ee rv
o n p ie
te go w
pe lep ats
op ho
l n
c la e ta e
ss ble
ro
om
pe
op tr
le ain
k e sta
pe y irs
op b o
le ard
do
gs
Co
ok
Relative contribution per expert
Baseline Retrieval Experts: JAWS Analysis
Comparison of Unimodal Retrieval Experts
Speech/Text-Based
Concepts (Co-occurrence Map)
Concepts (WordNet Brown Map)
Concepts (WordNet Web Map)
Concepts (Content-Based)
Visual Content-Based
100%
80%
60%
40%
20%
0%
For the adventurous:
– Stacked chart showing contribution of each expert per topic
© 2007 IBM Corporation
Summary of Submitted and Unsubmitted IBM Runs
Description
Code
Run ID
Text baseline
T
Text
A
0.0149
Concept baseline (stat)
MS
-
A
0.0045
Concept baseline (WordNet, Brown stats)
MB
-
A
0.0146
Concept baseline (WordNet, Web stats)
MW
-
C
0.0177
Concept baseline (content-based, type A)
S
-
A
0.0228
Concept baseline (content-based, type C)
SC
-
C
0.0249
Visual baseline
V
-
A
0.0302
Non-text baseline (MS + S + V)
MSSVavg
MSV
A
0.0213
Non-text baseline (MB + S + V)
MBSVavg
-
A
0.0301
Query-dynamic fusion, type A (T + MS + S + V)
TMSSVqdyn
TMSV.qdyn
A
0.0210
Query-dynamic fusion, type C (T + MSW + SC + V)
TMSWSCVqdyn
TMSV-C.qdyn
C
0.0272
Multimodal AVG fusion, type A (T + MS + S + V)
TMSSVavg
TMSV
A
0.0231
Multimodal AVG fusion, type C (T + MSW + SC + V)
TMSWSCVavg
-
C
0.0303
Multimodal AVG fusion, type C (T + MW + SC + V)
TMWSCVavg
-
C
0.0372
Oracle fusion (pick best run for each topic)
Oracle
-
C
0.0638
13
Type MAP
© 2007 IBM Corporation
Overall Performance Comparison
Mean Average Precision
Automatic Search (Best-of Team Performance)
0.10
0.09
Best IBM Oracle Fusion Run
Best IBM Unsubmitted Run
Best IBM Submitted Run
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
F_A_1_JTU_FA_1_1
F_A_1_UTen_1
F_A_2_owa07AS03_2
F_A_2_UTwiki-t2nm_2
F_C_1_NAI_1
F_A_2_SC-2_2
F_A_1_SRG_KN2_2
F_C_1_UGSys_2
F_C_2_IBM-TMSVC.qdyn_3
F_A_2_OM_6_1
F_A_1_CSB_visbl_1
F_B_1_A-MM_5
F_A_2_SION_3
F_A_2_yUHK_SCS_2
F_A_2_IBM-Unofficial
F_A_2_ua_4
F_A_1_cuImgBaseline_4
F_A_2_CT_3
F_A_2_RA-USTCSJTU-SEARCH_1
F_C_IBM_Oracle
0.00
Best run from each organization shown
Submitted IBM runs in third tier, improve by dropping failed run
IBM runs achieve highest AP scores on 5 of 24 topics
14
© 2007 IBM Corporation
IBM Automatic Search
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
st
re
e
s
ro h e t n
ad ep igh
f ro g t
m oat
s
s
pe tre veh
rs e t i c
on m le
te ark
le
p et
ca h o n
do nal e
o r ri v
op er
en
ed
b
i
m cy
o
st un cle
re ta
e t in
pr s
st w ote
re at st
et er
b u f ro
ild nt
in
C
gs
oo
k cro
ch w
ar d
ac
m
te
us
r
ica
bo
li
at
ns
t
t ru ra
m in
en
ts
w cl brid
om as g
a n sro e
i n om
pe t erv
o
ie
pe ple w
t
op a
b
pe le s le
t
op ai
le rs
d
ke o g
yb s
oa
rd
Performance Relative to Best Per-Topic Run (%)
IBM Performance Relative to Best AP Per Topic
Good (>90%): street night, street market, sheep/goat, road/vehicle, people telephone, canal/river
So-so (40-60%): mountains, waterfront, boat, train, crowd, street protest, street buildings
Bad (<25%): people table, people stairs, people dogs, people keyboard
15
© 2007 IBM Corporation
Summary: What Worked and What Didn’t?
Baseline retrieval experts
Grade
Speech/text-based
Concept-based (statistical mapping)
Concept-based (WordNet mapping, Brown corpus stats)
Concept-based (WordNet mapping, web stats)
Concept-based (query examples modeling)
Visual-based
Experiments
16
Improving query-to-concept mapping: web-based stats
Leveraging external resources: type C runs
Query-dynamic multimodal fusion
© 2007 IBM Corporation
Part 2 of 2:
Interactive Search
© 2007 IBM Corporation
Annotation-based Interactive Retrieval
Model interactive search as a video annotation task
– Consider each query topic as a keyword, annotate video keyframes
– Extend from CMU’s Extreme Video Retrieval system [Hauptmann et al., MM’06]
Hybrid annotation system
– Minimize annotation time by leveraging two annotation interfaces
– Tagging: Flickr, ESP game [von Ahn et al., CHI’04]
– Browsing: IBM’s EVA [Volker et al., MM’05], CMU’s EVR [Hauptmann et al., MM’06]
Formal analysis by modeling the interactive retrieval process
– Tagging-based annotation time per shot
– Browsing-based annotation time per shot
18
© 2007 IBM Corporation
Manual Annotation (I) : Tagging
Allow users to associate a single image at a time with one or more keywords
(the most widely used manual annotation approaches)
Advantages
– Freely choose arbitrary keywords to annotate
– Only need to annotate relevant keywords
Disadvantages
– “Vocabulary mismatch” problem
– Inefficient to design and type keywords
Suitable for annotating rare keywords
19
© 2007 IBM Corporation
Formal Time Model for Tagging
Key factors for tagging time model
– Number of keywords K for image l
– Annotation time for kth word t’fk
– Initial setup time for a new image t’s
– Noise term ε (zero-mean distribution)
Annotation time for one image
T = t ′f 1 + ... + t ′fk + ts′ + ε = ∑ t ′fk + ts′ + ε
k
Total expected annotation time for the entire collection containing L images
– Assumption: the expected time to annotate the kth word is constant tf
E (Ttotal ) = ∑∑ E t ′fkl + E ( ts′ ) = ∑ K l t f + ts
( )
l
20
kl
l
© 2007 IBM Corporation
Validation of Tagging Time Model
User study on TRECVID’05 development data
– A user to manually tag 100 images using 303 keywords
– If the model is correct, a linear fit should be found in the results
– The annotation results fit the model very well
21
© 2007 IBM Corporation
Manual Annotation (II) : Browsing
Allow users to associate multiple images with a single word at the same time
Advantages
– Efficient to annotate each pair of
images and words
– No “vocabulary mismatch”
Disadvantages
– Need to spend time on judging both
relevant and irrelevant pairs
– Start with controlled vocabulary
– Annotate one keyword at a time
Suitable for annotating frequent keywords
22
© 2007 IBM Corporation
Formal Time Model for Browsing
Key factors for browsing time model
– Number of relevant images Lk for a word k
– Annotation time for a relevant image t’p
– Annotation time for an irrelevant image t’n
– Noise term ε (zero-mean distribution)
Annotation time for all images w.r.t. a keyword
L − Lk
Lk
T =
∑ t′
pl
l =1
+
∑
t nl′ + ε
l =1
Total expected annotation time for the entire collection containing L images
– Assumption: the expected time to annotate a relevant (irrelevant) image is constant


E (Ttotal ) = ∑  ∑ E t ′plk + ∑ E tnl′ k  = ∑ ( Lk t p + ( L − Lk ) tn )
k  lk
lk
 k
( )
23
( )
© 2007 IBM Corporation
Validation of Browsing Time Model
User study on TRECVID’05 development data
– Three users to manually browse images in 15 minutes (for 25 keywords )
– If the model is correct, a linear fit should be found in the results
– The annotation results fit the model for all users
24
© 2007 IBM Corporation
Video Retrieval as Hybrid Annotation
Jointly annotates all topics at the same time
Switches between tagging and browsing annotation interfaces in
order to minimize the overall annotation time
– Formally model the annotation time as functions of word frequency, time
per word, and annotation interfaces
– Online machine learning to select images, topics and interfaces based on
the annotation models
More details will be released in the final notebook paper
– See related analysis in R. Yan et al. [ACM MM 2007 Workshop on Many Faces of
Multimedia]
25
© 2007 IBM Corporation
TRECVID-2007 Performance Analysis
The proposed approach allows users to annotate 60% of the image-topic pairs, as
compared with ~10% allowed by simple browsing
Balance between tagging & browsing: 1529 retrieved shots from tagging, 797 retrieved
shots from browsing
Simple temporal expansion can improve MAP from 0.35 to 0.43
Mean Average Precision
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
26
-E
ER
I_
C
_1
_2
_V
G
G
_I
_1
_1
I_
B
_2
_A
-M
M
_1
I_
A
_2
_C
T_
I_
A
1
_2
_A
L_
C
O
_3
I_
A
_2
I_
A
_u
_2
a_
_a
6
ce
-s
ho
t-E
-2
_2
I_
A
_2
_d
an
_P
_1
I_
a_
2_
SC
-1
_1
I_
C
_1
_t
29
2.
IT
I_
2
I_
c_
2_
U
Q
2_
2
_2
_J
W
I_
A
I_
A
_E
ER
_E
xp
an
d
0.00
© 2007 IBM Corporation
st
re
s
ro h et n
ad ee ig
f ro p g ht
m oa
pe stre veh ts
rs e t i c
on m le
te ark
le
p et
ca ho
do na ne
o r l ri
op ver
en
e
b d
m icy
st oun cle
re ta
e t in
pr s
st w ote
re at st
e t er
b u f ro
ild nt
in
C
gs
oo
c
k
ch row
ar d
ac
m
te
us
r
ica
bo
li
at
ns
t ru tra
m in
en
ts
w c bri
om la dg
an ssr e
i n oo
pe t er m
o vi
pe ple ew
op t a
b
pe le s le
op t a
le irs
d
ke og
yb s
oa
rd
Average Precision
TRECVID-2007 Per-Query Performance Analysis
1.00
0.90
27
Best IBM
Best Non-IBM
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Good: “telephone”, “interview”, “mountains”
Not as good: “canal”, “bicycle”, “classroom”, “dog”
Highest AP scores on 10 out of 24 topics
© 2007 IBM Corporation
Potential Improvement
Search beyond keyframe (look at the I-frame)
Finding
Sheep/Goats
Search beyond I-frame (leverage text info or temporal context)
Finding
Sheep/Goats
Better online learning algorithms to search for more shots in the
same amount of time (include temporal information)
28
© 2007 IBM Corporation