Parameterized generation of labeled datasets for text categorization based on a hierarchical directory Evgeniy Gabrilovich [email protected] joint work with Dmitry Davidov and Shaul Markovitch CS Department, Technion Outlook • Automatic acquisition of labeled datasets for text categorization from the WWW – Capitalizing on the knowledge encoded in hierarchical directories such as ODP • Parameterized dataset generation datasets with desired properties – Novel metrics for predicting dataset difficulty Text categorization • Assigning category labels to natural language documents based on their text – Email filtering – News routing – Organizing document collections • Evaluation is critically dependent on labeled test collections – supervised learning Problem: existing datasets are scarce • Few datasets are available – Enormous labeling and production efforts – Text domains and languages are limited • Frequently used datasets – Reuters-21578, RCV1 (new Reuters corpus) – 20 Newsgroups – OHSUMED Proposed solution • Automatic acquisition of numerous datasets with predefined characteristics – Use existing Web directories – Define parameters of datasets – Perform automatic generation of test collections Benefits of the approach • Little human supervision required – automatic dataset generation • Better control over TC experiments – Parameters on datasets and individual categories • Data suitable for regular TC, hierarchical TC, hypertext classification – Using symbolic links yields multi-labeled datasets Using classified Web directories • Publicly available labeling projects – Large size, comprehensive contents – Extended through ongoing development – Examples: ODP, Yahoo • Dataset acquisition – Directory serves as a source of URLs – Documents are labeled with corresponding categories (organized in a hierarchy) Assumptions • Directory is organized as a tree – Each node is labeled with a category • Each category (= directory node) is associated with a collection of documents • Categories are provided with text descriptions – Documents are optionally accompanied by short annotations Sites which offer ways to search the Internet for information, or describe or discuss ways to search it. Suitable directories • • • • Open Directory Project Yahoo! Medical Subject Headings (MeSH) Library classification schemes – UDC, Dewey • Wikipedia Parameterization of dataset generation • Kinds of parameters: – Characterizing dataset as a whole – Characterizing individual categories that comprise the dataset • Cardinality • Coherence • Document contents and language Defining metrics for predicting dataset hardness • Goal: creating datasets with varying degree of difficulty for text categorization • Means: measuring conceptual distance between categories – Intuition: larger distance easier dataset – Controlling the degree of separability of the two categories – Creating datasets with predictable difficulty Maximum Achievable Accuracy (MAA) • How difficult is a given dataset for learning algorithms ? – Just run the algorithms on it ! • Well, SVM will probably suffice … not really • No single algorithm is best in all cases – SVM outperformed by KNN: 22% – SVM outperformed by C4.5: 29% – KNN outperformed by C4.5: 39% – Differences are often statistically significant Maximum Achievable Accuracy (cont’d) • Intuition: dataset is hard if it is difficult even for the best classification algorithm • distMAA ( c1 , c2 ) = max Accuracy a ( c1 , c2 ) a∈C where c1,c2 are categories in the dataset, and C is a set of classifiers • Problem: very inefficient for “generate-and-test” – Download, clean and organize the documents – Classify the dataset and compute the MAA Solution: heuristic metrics Goal: develop efficient metrics correlated with MAA 1. Edge-counting graph metric dist graph ( c1 , c2 ) = # edges in the tree path from c1 to c2 distgraph = 5 c2 c1 Heuristic metrics (cont’d) 2. WordNet-based textual metric 1 dist ( D1 , D2 ) = dist ( w, D2 ) ∑ | D1 | w∈D1 dist ( w, D ) = min sim( w, w' ) w '∈D dist ( D1 , D2 ) + dist ( D2 , D1 ) disttext ( c1 , c2 ) = 2 c1 D1 c2 D2 Methodology for automatic dataset generation • Raw data acquisition – Locate a pair of suitable categories • Graph metric - easy • Text metric – iterated hill-climbing – Documents are collected from chosen categories and all their sub-categories – Each Web site yields one synthetic document (concatenated individual Web pages) Methodology for automatic dataset generation (cont’d) • Filtering the data – Pre-processing: prune directory branches – Online: disregard external links – Post-processing • Weak filtering: discards pages with little text • Strong filtering: discards unrelated pages (outliers by content) using the text metric • Collected data is rendered in an XML-like format Empirical Latin – to call, evaluation summon • Using the Open Directory Project – 6M sites, 550K categories, 61K editors • Dataset acquisition system: Accio – 300 datasets of varying difficulty • Text categorization infrastructure – Classifiers: SVM, C4.5 Incantation forKNN, the Summoning – Measuring classification accuracy Charm, which causes an object •called 10-fold cross-validation for to fly to the caster Distance metrics vs. SVM accuracy 0.95 0.9 0.85 0.8 Correlation: 0.831 0.75 0.7 0.8 0.75 0.7 0.65 Text distance 28 26 24 22 20 18 16 14 12 0.6 10 Correlation: 0.564 0.85 8 Graph distance 0.9 6 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0.95 4 0.6 1 2 0.65 Classification accuracy (SVM) Classification accuracy (SVM) 1 Correlation between the two metrics 25 Text distance 20 15 10 5 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Graph distance Correlation: 0.614 Distance metrics vs. MAA 1 0.95 0.85 0.8 0.75 0.7 Correlation: 0.776 0.65 1 0.6 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0.95 Graph distance Correlation: 0.583 0.9 Max accuracy Max accuracy 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2 5 8 11 14 17 Text distance 20 23 26 29 Versatility of dataset generation Number of category pairs (millions) 14 12 10 8 6 4 2 0 1 5 9 13 Graph Distance 17 21 Versatility of dataset generation (cont’d) Number of category pairs (sampled distribution) 350 300 250 200 150 100 50 0 1 5 9 13 17 21 Text distance 25 29 33 Summary • Methodology and system for automatic acquisition of labeled datasets for TC – Datasets possess predefined characteristics – Suitable for a variety of TC flavors • Similarity metrics for document sets as reliable predictors of classification accuracy – Maximum Achievable Accuracy (MAA) – graph and text metrics – Verified on 300 datasets generated from ODP Summary (cont’d) • Filtering raw data to cope with noise • Large collection of TC datasets will be made publicly available presently – The Accio system will also be released later Further research directions • More sophisticated distance metrics • Additional parameters of categories • Advanced filtering techniques – Focused crawling • Efficient algorithms for locating categories subject to user’s requirements • Datasets with more than 2 categories – Generalize distance metrics appropriately Questions?