Presentation

IBM Haifa Research Lab
Social Search and Discovery
using a Unified Approach
Einat Amitay
David Carmel
Nadav Golbandi
Nadav Har'El
Shila Ofek-Koifman
Sivan Yogev
IBM Haifa Research Lab
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Web 2.0 data gives us
●
●
1
New wealth of information (produced by
ordinary users).
New types of information – social information:
–
User-supplied metadata for documents
(bookmarks, tags, ratings, comments).
–
Relationships between people and documents
(who wrote a document, who tagged it, etc.).
–
Relationships between people and other people.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Outline of this talk
●
Enterprise search.
●
Unified search: document & person.
●
1
Implementation of the unified search using
faceted search.
●
Adding tags to the equation(s).
●
The system and its evaluation.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Enterprise search
First goal: use social information to improve
search in an enterprise intranet (IBM).
●
1
Improve the relevance of document results:
–
Add tags and comments to the searched text of the
documents.
–
Important documents can be recognized by user
activity around them (bookmarking, comments, etc.)
–
Allow search outside the enterprise boundaries.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Enterprise search
First goal: use social information to improve
search in an enterprise intranet (IBM).
●
●
●
1
Implementation: indexing only user-generated
data from enterprise sources.
Without using popularity – mediocre quality.
When using popularity, precision is vastly
improved over standard full-text search (P@10
between 0.7-0.8).
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Unified search
●
1
When in need of information,
–
Some people like to find a written document.
–
Some people like to find a person to ask.
–
Most people are between these extremes.
–
And each source is better in different situations.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Unified search
●
●
●
1
Second goal: return more types of relevant
information sources.
Given a query, we want the search engine to
return:
–
A ranked list of documents relevant to the query
–
A ranked list of people with interest in the query
topic
We also want to use people in queries:
–
“John Smith”
–
information retrieval “John Smith”
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
1
A person is relevant to a query if he or she are
related to documents relevant to the query.
Therefore, given a query:
–
Find all documents relevant to this query,
–
Find people relevant to these documents.
●
[McDonald & Ounis, Balog & de Rijke, 2006]
●
But how to score?
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
In VSM, documents define relevance matrix D,
between documents and terms.
Can be viewed as a bipartite graph:
term1
doc1
1
term2
doc2
term3
doc3
term4
doc4
term5
doc5
term6
doc6
term7
doc7
doc8
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
In VSM, documents define relevance matrix D,
between documents and terms.
A query is a vector of terms q
query
term1
doc1
term2
doc2
term3
doc3
term4
doc4
term5
doc5
term6
doc6
term7
doc7
doc8
Search results: D×q
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
In VSM, documents define relevance matrix D,
between documents and terms.
A query is a vector of terms q
query
term1
doc1
term2
doc2
term3
doc3
term4
doc4
term5
doc5
term6
doc6
term7
doc7
doc8
Search results: D×q
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
Document-person relationships define relevance
matrix P between documents and people.
Yet another bipartite graph:
And we get a multi-layer graph:
term1
doc1
term2
doc2
person1
1
term3
doc3
person2
term4
doc4
term5
doc5
person3
term7
term6
doc6
person4
doc7
doc8
person5
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
PT×D is the relevance matrix between terms and
people.
PT×D×q are (scored) people search results.
query
term1
doc1
term2
doc2
person1
1
term3
doc3
person2
term4
doc4
term5
doc5
person3
term7
term6
doc6
person4
doc7
doc8
person5
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
Returning to the Vector Space Model:
–
PT×D is the relevance matrix between terms and
people.
PT×D×q are (scored) people search results.
query
term1
doc1
term2
doc2
person1
1
term3
doc3
person2
term4
doc4
term6
term5
doc5
person3
doc6
person4
term7
doc7
doc8
person5
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
●
1
Calculating (PT×D)×q seems like the obvious
solution, but:
–
Keeping PT×D up-to-date is hard
–
Document and person search done using two
different matrices (D and PT×D)
–
Lose non-VSM search engine features (phrase, etc)
We therefore use the equivalent formula:
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Person search
●
●
●
1
Score for person i, (PT×D×q)i =
Already proposed in Balog & de Rijke, with
different (probabilistic) justification.
But how to calculate the paths?
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Faceted search
●
●
●
●
1
Commonly used technique for adding
navigation to a search engine.
A facet is a single attribute of the document.
In a camera search application, documents
might have a “Brand” and “Price” facets.
To each document, several categories are
added. For example “Brand/Sony” or “Price
Range/$90-$40”.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Faceted search
●
1
Simplest faceted search efficiently goes over
matching documents, counting for each
category the number of documents:
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Faceted search
●
●
●
1
In our application, we have “Related Person”
facets.
Categories like “Related Person/John Smith”
attached to document, with a weight.
Instead of just counting, efficiently aggregate
expressions. For person i category:
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Faceted search
●
The actual index contains only one layer,
Person-Document relationships.
●
Document scores based on Lucene queries.
●
Person scores are calculated through facets.
query
doc1
doc2
person1
1
doc3
person2
doc4
doc5
person3
doc6
person4
doc7
doc8
person5
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Tags search
●
●
Tags are also added as facets to documents.
Therefore tag scores can be calculated like
person scores.
query
doc1
doc2
tag1
1
doc3
tag2
doc4
doc5
tag3
doc6
tag4
doc7
doc8
tag5
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Faceted search
●
More faceted search features we use:
–
Query-independent static score for categories
(category boost).
–
Special query for “Facet f” returns all documents in
this category, sorted by the category weight.
doc1
doc2
person1
doc3
person2
doc4
doc5
person3
doc6
person4
doc7
doc8
person5
query
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
The Social Search Application
●
Data from some of IBM's internal Web 2.0 sites:
–
–
–
1
Over 77,000 blog threads (thread = entry +
comments)
●
Content: Blog entry, comments, tags
●
Person facet: author, commenter, bookmarker
Over 370,000 bookmarks to over 230,000 Webpages
●
Content: Titles, user descriptions, tags
●
Person facet: bookmarker
Over 15,500 people who created that content
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Main Features
●
●
●
●
1
Queries can contain a combination of Lucene
textual queries, people and tags.
Result sets can contain a combination of web
pages, people and tags.
Easy navigation to narrow search, or find
evidence to the current results.
Good quality of results...
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Evaluation
●
●
1
We return documents, people and tags for
every query – need to evaluate precision.
Document results evaluated using standard
technique:
–
50 real textual queries chosen from query logs
–
The top results judged by humans as being
“relevant”, “very relevant” or “irrelevant”.
–
Very high precision demonstrated (P@10 ~ 0.8).
–
Much better than full-text enterprise search.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Evaluation
●
1
“Related people” evaluation – large user study
–
60 real textual queries chosen from query logs.
–
100 related people retrieved for each query.
–
Each person was mailed listing 6-15 queries (some
believed to be relevant and some irrelevant):
Rate 1-5 whether the topic is relevant to you.
–
612 people responded, from 116 IBM locations in
38 countries.
–
The ranked list of related people we generate are
compared to these self-ratings using NDCG metric.
–
Compare full scoring formula to simpler ones.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Evaluation
●
Evaluation results:
Aggregation
expression
NDC
G 10
NDC
G 20
NDC
G 30
Count only
“votes”
0.71
0.69
0.68
Sum of scores
“CombSUM”
0.75
0.73
0.72
0.74
0.73
0.76
0.74
+relationship
0.76
weights
+person static
score: ief
0.77
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Conclusions
●
●
●
●
1
Web 2.0 data provides an excellent source for
document and people search in an enterprise.
Unified search can be easily realized using
faceted search.
In a 612-respondent study, the full scoring
formula was shown better than simpler
versions.
Also strengthens previously published results
by using a new data set and evaluation.
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
The real world...
tag3
tag2
doc2
person2
person1
tag5
doc7
doc6
person3
tag7
doc8
person4
tag6
doc1
doc3
doc4
tag1
person5
doc5
tag4
1
©©
2008
2005IBM
IBMCorporation
Corporation
IBM Haifa Research Lab
Thank you!
1
©©
2008
2005IBM
IBMCorporation
Corporation