IBM Haifa Research Lab Social Search and Discovery using a Unified Approach Einat Amitay David Carmel Nadav Golbandi Nadav Har'El Shila Ofek-Koifman Sivan Yogev IBM Haifa Research Lab 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Web 2.0 data gives us ● ● 1 New wealth of information (produced by ordinary users). New types of information – social information: – User-supplied metadata for documents (bookmarks, tags, ratings, comments). – Relationships between people and documents (who wrote a document, who tagged it, etc.). – Relationships between people and other people. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Outline of this talk ● Enterprise search. ● Unified search: document & person. ● 1 Implementation of the unified search using faceted search. ● Adding tags to the equation(s). ● The system and its evaluation. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Enterprise search First goal: use social information to improve search in an enterprise intranet (IBM). ● 1 Improve the relevance of document results: – Add tags and comments to the searched text of the documents. – Important documents can be recognized by user activity around them (bookmarking, comments, etc.) – Allow search outside the enterprise boundaries. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Enterprise search First goal: use social information to improve search in an enterprise intranet (IBM). ● ● ● 1 Implementation: indexing only user-generated data from enterprise sources. Without using popularity – mediocre quality. When using popularity, precision is vastly improved over standard full-text search (P@10 between 0.7-0.8). ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Unified search ● 1 When in need of information, – Some people like to find a written document. – Some people like to find a person to ask. – Most people are between these extremes. – And each source is better in different situations. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Unified search ● ● ● 1 Second goal: return more types of relevant information sources. Given a query, we want the search engine to return: – A ranked list of documents relevant to the query – A ranked list of people with interest in the query topic We also want to use people in queries: – “John Smith” – information retrieval “John Smith” ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● 1 A person is relevant to a query if he or she are related to documents relevant to the query. Therefore, given a query: – Find all documents relevant to this query, – Find people relevant to these documents. ● [McDonald & Ounis, Balog & de Rijke, 2006] ● But how to score? ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – In VSM, documents define relevance matrix D, between documents and terms. Can be viewed as a bipartite graph: term1 doc1 1 term2 doc2 term3 doc3 term4 doc4 term5 doc5 term6 doc6 term7 doc7 doc8 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – In VSM, documents define relevance matrix D, between documents and terms. A query is a vector of terms q query term1 doc1 term2 doc2 term3 doc3 term4 doc4 term5 doc5 term6 doc6 term7 doc7 doc8 Search results: D×q 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – In VSM, documents define relevance matrix D, between documents and terms. A query is a vector of terms q query term1 doc1 term2 doc2 term3 doc3 term4 doc4 term5 doc5 term6 doc6 term7 doc7 doc8 Search results: D×q 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – Document-person relationships define relevance matrix P between documents and people. Yet another bipartite graph: And we get a multi-layer graph: term1 doc1 term2 doc2 person1 1 term3 doc3 person2 term4 doc4 term5 doc5 person3 term7 term6 doc6 person4 doc7 doc8 person5 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – PT×D is the relevance matrix between terms and people. PT×D×q are (scored) people search results. query term1 doc1 term2 doc2 person1 1 term3 doc3 person2 term4 doc4 term5 doc5 person3 term7 term6 doc6 person4 doc7 doc8 person5 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● Returning to the Vector Space Model: – PT×D is the relevance matrix between terms and people. PT×D×q are (scored) people search results. query term1 doc1 term2 doc2 person1 1 term3 doc3 person2 term4 doc4 term6 term5 doc5 person3 doc6 person4 term7 doc7 doc8 person5 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● ● 1 Calculating (PT×D)×q seems like the obvious solution, but: – Keeping PT×D up-to-date is hard – Document and person search done using two different matrices (D and PT×D) – Lose non-VSM search engine features (phrase, etc) We therefore use the equivalent formula: ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Person search ● ● ● 1 Score for person i, (PT×D×q)i = Already proposed in Balog & de Rijke, with different (probabilistic) justification. But how to calculate the paths? ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Faceted search ● ● ● ● 1 Commonly used technique for adding navigation to a search engine. A facet is a single attribute of the document. In a camera search application, documents might have a “Brand” and “Price” facets. To each document, several categories are added. For example “Brand/Sony” or “Price Range/$90-$40”. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Faceted search ● 1 Simplest faceted search efficiently goes over matching documents, counting for each category the number of documents: ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Faceted search ● ● ● 1 In our application, we have “Related Person” facets. Categories like “Related Person/John Smith” attached to document, with a weight. Instead of just counting, efficiently aggregate expressions. For person i category: ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Faceted search ● The actual index contains only one layer, Person-Document relationships. ● Document scores based on Lucene queries. ● Person scores are calculated through facets. query doc1 doc2 person1 1 doc3 person2 doc4 doc5 person3 doc6 person4 doc7 doc8 person5 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Tags search ● ● Tags are also added as facets to documents. Therefore tag scores can be calculated like person scores. query doc1 doc2 tag1 1 doc3 tag2 doc4 doc5 tag3 doc6 tag4 doc7 doc8 tag5 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Faceted search ● More faceted search features we use: – Query-independent static score for categories (category boost). – Special query for “Facet f” returns all documents in this category, sorted by the category weight. doc1 doc2 person1 doc3 person2 doc4 doc5 person3 doc6 person4 doc7 doc8 person5 query 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab The Social Search Application ● Data from some of IBM's internal Web 2.0 sites: – – – 1 Over 77,000 blog threads (thread = entry + comments) ● Content: Blog entry, comments, tags ● Person facet: author, commenter, bookmarker Over 370,000 bookmarks to over 230,000 Webpages ● Content: Titles, user descriptions, tags ● Person facet: bookmarker Over 15,500 people who created that content ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Main Features ● ● ● ● 1 Queries can contain a combination of Lucene textual queries, people and tags. Result sets can contain a combination of web pages, people and tags. Easy navigation to narrow search, or find evidence to the current results. Good quality of results... ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Evaluation ● ● 1 We return documents, people and tags for every query – need to evaluate precision. Document results evaluated using standard technique: – 50 real textual queries chosen from query logs – The top results judged by humans as being “relevant”, “very relevant” or “irrelevant”. – Very high precision demonstrated (P@10 ~ 0.8). – Much better than full-text enterprise search. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Evaluation ● 1 “Related people” evaluation – large user study – 60 real textual queries chosen from query logs. – 100 related people retrieved for each query. – Each person was mailed listing 6-15 queries (some believed to be relevant and some irrelevant): Rate 1-5 whether the topic is relevant to you. – 612 people responded, from 116 IBM locations in 38 countries. – The ranked list of related people we generate are compared to these self-ratings using NDCG metric. – Compare full scoring formula to simpler ones. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Evaluation ● Evaluation results: Aggregation expression NDC G 10 NDC G 20 NDC G 30 Count only “votes” 0.71 0.69 0.68 Sum of scores “CombSUM” 0.75 0.73 0.72 0.74 0.73 0.76 0.74 +relationship 0.76 weights +person static score: ief 0.77 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Conclusions ● ● ● ● 1 Web 2.0 data provides an excellent source for document and people search in an enterprise. Unified search can be easily realized using faceted search. In a 612-respondent study, the full scoring formula was shown better than simpler versions. Also strengthens previously published results by using a new data set and evaluation. ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab The real world... tag3 tag2 doc2 person2 person1 tag5 doc7 doc6 person3 tag7 doc8 person4 tag6 doc1 doc3 doc4 tag1 person5 doc5 tag4 1 ©© 2008 2005IBM IBMCorporation Corporation IBM Haifa Research Lab Thank you! 1 ©© 2008 2005IBM IBMCorporation Corporation