Dynamic Multimodal Fusion in Video Search.

Dynamic Multimodal Fusion in
Video Search
Lexing Xie
IBM T J Watson Research Center
joint work with Apostol Natsev, Jelena Tesic,
Rong Yan, John R Smith
WOCC-2007, April 28th, 2007
© 2007 IBM
The Multimedia Search Problem
Media repository
multi-modal query
produced video
corpra
Online media shares
“Find shots of flying through
snow capped mountains”
Personal media
Digital life records
Impact
– Widely applicable: consumer, media, enterprise, web, science …
– Bridging Traditional Search and Multimedia Knowledge Extraction
Multiple Tipping Points: multimedia semantics learning, multimedia ontology,
training and evaluation, …
NIST TRECVID benchmark
– Validation for emerging technologies
– Scaling video retrieval technologies to large-scale applications
© 2007 IBM Corporation
Multimedia Search: Query Topics overview
Query
Types
Query topic
Query Specificity
Generic
Specific/Named
Palm trees, tanks, ship/boat
Find objects
Topic 13.
Speaker talking
in front of the US
flag
People shaking hands,
George W. Bush, Condoleeza
people with banners, people Rice, Tony Blair, Iyad Allawi,
in a meeting, people entering Omar Karami, Hu Jintao,
or leaving a building
Mahmoud Abbas
Find people
Topic 4. Scenes
of snow capped
mountains
Soccer score, basketball,
tennis, airplane take-off,
helicopter in flight
Find events
Find sites
Topic 48. Other
examples of overhead
zooming-in views of
canyons in Western
United States.
Tall buildings, office setting,
Map of Iraq showing Baghdad
road with cars, fire
© 2007 IBM Corporation
Outline
The multimedia search challenge
A birds-eye’s view of IBM Multimedia Search
System
Query-dependent multimodal fusion
Evaluation on TRECVID benchmark
Summary
© 2007 IBM Corporation
IBM Multimedia Search System Overview
Query
Approaches:
Textual query
formulation
“Find shots of an
airplane taking off”
Query topic examples
LSCOM-lite
Semantic
Models
TEXT
VISUAL FEATURES
1. Text-based: story-based retrieval
with automatic query refinement/reranking
2. Model-based: automatic query-tomodel mapping based on query
topic text
3. Semantic-based: cluster-based
semantic space modeling and data
selection
1
Text-Based
Retrieval
2
3
Model-Based
Retrieval
Semantic
Based
Retrieval
4
Visual-Based
Retrieval
4. Visual-based: light-weight learning
(discriminative and nearest neighbor
modeling) with smart sampling
5. Fusion:
– Query independent
5
Fusion
Multi-modal results
(Text + Model + Semantic + Visual)
– Query-class-dependent with
soft, hard or dynamic class
membership
© 2007 IBM Corporation
1
IBM Text Retrieval System
Corpus Indexing
Textual query topic
“Find shots of an
airplane taking off”
Query
Query analysis:
Query analysis
Tokenization, phrasing, stemming
Part-of-speech tagging & filtering
“airplane”, “take off”
Shot-based query
refinement
“pilot”
Shot-based refined
query execution
ranked
shots
Shot-level ASR/MT documents
Story-level ASR/MT documents
Both aligned at phrase level
Story-based query
refinement
“crash”
Story-based refined
query execution
ranked
stories
Shot-level score
propagation & ranking
ranked
shots
Shot-level fusion and re-ranking
Final text-based
ranking of shots
Query refinement:
Pseudo-Relevance Feedback
Query execution and fusion
IBM Semantic Search Engine (Juru)
Using TF*IDF-based retrieval
Fusion of shot- and story-based results
Performance (MAP):
2005: 0.09873
2006: 0.05169
T. Volkmer et al, ICME 2006
© 2007 IBM Corporation
2
IBM Model-based Retrieval System
Textual query topic
“Find shots of an
airplane taking off”
Lexical (WordNet) Approach
Query
Query analysis
– Tokens, stems, phrases
– Stop word removal
Query Analysis
Tokenizer
“airplane”,
“take off”
WordNet
Query Refinement
(Query-to-Concept
Mapping & Weighting)
Concept
Lexicon
Outdoors (1.0), Sky (0.8), Military (0.6)
Query Execution
(Concept-Based
Retrieval)
Concept
Models
Repository
Model-based ranking
Query refinement
– Automatic mapping of query text to
concept models & weights
– Lexical approach based on WordNet
Lesk similarity
Query execution
– Concept-based retrieval using
statistical concept models
– Weighted averaging model fusion
Performance (MAP)
2006: 0.029
Haubold et al. (ICME 2006)
© 2007 IBM Corporation
4
IBM Visual Retrieval System
Visual query examples (airplane take-off)
Visual concepts can
help refine query
topics
Extract visual features
Content-based over-sampling
of visual query examples
Selection
Content-based downsampling of test set
Atomic CBR
SVM run for each bag
OR Fusion
Extract visual features
AND Fusion
SVM learning
MECBR
SVM
Fusion
Data modeling for
discriminative
learning:
– Content-based
over-sampling of
visual query to
create positive
examples
– Content –based
pseudo-negative
sampling in the
development and
test set.
Fusion
Visual Run
© 2007 IBM Corporation
3
IBM Visual-Semantic Retrieval
Map Visual examples into Semantic Space
Compute Semantic Map query examples to
Concept Scores
Semantic Space
LSCOM-lite:
•Sky
•Road
•Maps
– Visual context can disambiguate word
senses
Sky
X
Airplane
O
Road
Cluster-based Data Modeling Approach
and Sample Selection from testset
PPPP P
PPPP P
PPPP P
NNNNN
NNNN
NNNNN
NNNN
NNNNN
NNNN
Semantic-based confidence list
Semantic Concept Lexicon
– Hierarchy of 39 LSCOM-lite concepts
– Statistical models based on visual
features and machine learning
– Dimensions correspond to LSCOMlite
AND fusion of primitive SVM runs
N
Idea
– Leverage automatic visual concept
detectors for semantic model-based query
refinement
Primitive SVM run for each bag
NN
N
N
N NN P P P N N
P P
P
N N N
N
N NN N N
N
N
Observations
We model query topic examples in
semantic space
– Any descriptor space
© 2007 IBM Corporation
5
Multi-modal Fusion
Training queries
semantic
query analysis
Text
Test queries
Model
query
matching
Semantic Visual
search for
weights
Text
Model
Semantic Visual
weighted fusion
Each modality is good in finding certain
things
– Text: named people, other named
entity
– Visual: semantic scenes consistent in
color or layout, e.g., sports, weather
– Concept Models: non-specific queries,
e.g. protest, boat, fire
Averaging the retrieval models helps in
any case
Query-class or query-cluster dependent
fusion helps more [CMU, NUS, Columbia]
Query soft/hard/dynamic class approach
– Semantic query analysis for query
matching
– Matching based on PIQUANT-II Q&A
text features)
– Weighted linear combination for fusion
– Training Queries: TRECVID2005
Performance MAP
– 2006: 0.0756 -> 0.087 -> 0.0937
© 2007 IBM Corporation
Extract Semantic Query Features
Input Query
Text
Semantic
Tagging
“people with”
computer display
Semantic
Categories
Query
Feature
Vector
Person:CATEGORY
BodyPart:UNKNOWN
Furniture:UNKNOWN
…
[IBM UIMA and PIQUANT Analysis Engine]
© 2007 IBM Corporation
Three Query-dependent Fusion Strategies
© 2007 IBM Corporation
Outline
The multimedia search challenge
A birds-eye’s view of IBM Multimedia Search
System
Query-dependent multimodal fusion
Evaluation and demo on TRECVID
benchmark
Summary
© 2007 IBM Corporation
NIST TRECVID Benchmark at a Glance
NIST benchmark for evaluating state-of-the-art in video retrieval
Benchmark tasks:
– Semantic Concept Detection; Search (2003-); Rushes (2005-)
– Shot detection and story segmentation
ipation
ic
t
r
a
P
g
Growin
TRECVID
2001
TRECVID
2002
12*
Participants
17
Participants
11 Hours
NIST video
73 Hours
Video from
Prelinger
archives
Growing D
TRECVID
2003
24
Participants
133 Hours
1998 ABC,
CNN news
& C-SPAN
TRECVID
2004
38
Participants
173 Hours
1998 ABC,
CNN news
& C-SPAN
ata Sets
TRECVID
2005
42*
Participants
220 Hours
of 2004 news
from U.S., Arabic,
Chinese sources
50 hours
BBC stock shots
(vacation spots)
TRECVID
2006
54*
Participants
380 Hours
of 2004 and 2005
news
from U.S., Arabic,
Chinese source
100 hours
BBC rushes
(interviews)
* Number of participants that completed at least one task
© 2007 IBM Corporation
Automatic/Manual Search Overall Performance (Mean AP)
IBM Official Runs:
TRECVID06 Automatic and Manual Search
(Aggre gated Pe rfomance of IBM vs. Othe rs)
IBM Runs
0.10
Mean Average Precision
0.08
IBM Automatic Search
Other Automatic Search
Other Manual Search
Text (baseline):
Text (story-based):
0.041
0.052
Multimodal Fusion:
Query independent:
0.076
Query classes (soft): 0.086
Query classes (hard): 0.087
0.06
IBM Modal Runs:
Semantic-based Run: 0.037
Visual-based Run: 0.0696
0.04
0.02
0.00
1 3
5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87
Submitted Runs
Multi-modal fusion doubles baseline performance!
Visual and semantic retrieval have a significant impact over a text baseline
© 2007 IBM Corporation
TRECVID’06 Fusion Results (Relative Performance)
Relative improvement (%)
– query-class fusion vs. query-independent fusion
Observations QindQclass
– Concept-related queries improved the most:
“tall building”, “prisoner”, “helicopters in flight”, “soccer”
– Named-entity queries improved slightly:
“Dick Cheney”, “Saddam Hussein”, “Bush Jr.”
– Generic people category deteriorated the most:
“people in formation”, “at computer display”, “w/ newspaper”, “w/
books”
200
150
100
50
ow
es
pe
op
le
&
sn
in
bu
i ld
cl
e
gs
ve
hi
co
cl
r ti
es
ng
a
pr
pr
is
ot
on
es
er
t&
bu
i ld
in
gs
D
ic
k
Sa
C
he
dd
ne
am
y
H
us
pe
se
op
in
le
+1
in
fo
rm
Bu
at
sh
io
n
,J
r.
wa
so
lk
in
ld
g
ie
rs
or
po
w
l ic
at
e
er
bo
at
-s
co
hi
m
p
pu
te
re
rd
ad
is
in
pl
g
ay
a
ne
w
sp
ap
a
er
na
tu
ra
ls
he
ce
l ic
ne
op
te
rs
bu
in
rn
fli
in
gh
fo
g
t
ur
w
ith
su
f la
ite
m
d
es
pe
op
pe
le
rs
w
.f
on
la
an
g
d
10
bo
ok
ad
s
ul
ta
nd
ki
ch
ss
il d
on
th
e
ch
ee
sm
k
ok
es
ta
C
ck
on
s
do
le
ez
a
so
R
ic
cc
e
er
go
al
po
st
-150
ll
nc
y
ge
em
er
-100
ve
hi
-50
ta
AV
G
s
0
© 2007 IBM Corporation
Improvement over TREC06’ and TREC’05
Qdyn and Qcomp > Qclass >> Qind
New improvements come from the increased weights on
concepts for object queries (next slide).
Improved queries …
© 2007 IBM Corporation
Results: Improved upon TRECVID’06 on generic people queries.
© 2007 IBM Corporation
Demo
© 2007 IBM Corporation
© 2007 IBM Corporation
© 2007 IBM Corporation
© 2007 IBM Corporation
© 2007 IBM Corporation
Thank You
For more information:
IBM Multimedia Retrieval Demo
http://mp7.watson.ibm.com/
“Dynamic Multimodal Fusion for Video Search”, by L.
Xie, A. Natsev, J. Tesic, ICME 2007, to appear
“IBM research TRECVID-2006 video retrieval system”,
by M. Campbell, S. Ebadollahi, M. Naphade, A. P.
Natsev, J. R. Smith, J. Tesic, L. Xie, K. Scheinberg, J.
Seidl, and A. Haubold, in NIST TRECVID Workshop,
November 2006.
© 2007 IBM Corporation
Similar pages