Presentation

Parameterized generation of labeled
datasets for text categorization
based on a hierarchical directory
Evgeniy Gabrilovich
[email protected]
joint work with
Dmitry Davidov and Shaul Markovitch
CS Department, Technion
Outlook
• Automatic acquisition of labeled datasets
for text categorization from the WWW
– Capitalizing on the knowledge encoded in
hierarchical directories such as ODP
• Parameterized dataset generation datasets with desired properties
– Novel metrics for predicting dataset difficulty
Text categorization
• Assigning category labels to natural
language documents based on their text
– Email filtering
– News routing
– Organizing document collections
• Evaluation is critically dependent on
labeled test collections
– supervised learning
Problem: existing
datasets are scarce
• Few datasets are available
– Enormous labeling and production efforts
– Text domains and languages are limited
• Frequently used datasets
– Reuters-21578, RCV1 (new Reuters corpus)
– 20 Newsgroups
– OHSUMED
Proposed solution
• Automatic acquisition of numerous
datasets with predefined characteristics
– Use existing Web directories
– Define parameters of datasets
– Perform automatic generation of test
collections
Benefits of the approach
• Little human supervision required
– automatic dataset generation
• Better control over TC experiments
– Parameters on datasets and individual
categories
• Data suitable for regular TC, hierarchical
TC, hypertext classification
– Using symbolic links yields multi-labeled
datasets
Using classified
Web directories
• Publicly available labeling projects
– Large size, comprehensive contents
– Extended through ongoing development
– Examples: ODP, Yahoo
• Dataset acquisition
– Directory serves as a source of URLs
– Documents are labeled with corresponding
categories (organized in a hierarchy)
Assumptions
• Directory is organized as a tree
– Each node is labeled with a category
• Each category (= directory node) is
associated with a collection of documents
• Categories are provided with text
descriptions
– Documents are optionally accompanied by
short annotations
Sites which offer ways to search
the Internet for information,
or describe or discuss ways to
search it.
Suitable directories
•
•
•
•
Open Directory Project
Yahoo!
Medical Subject Headings (MeSH)
Library classification schemes
– UDC, Dewey
• Wikipedia
Parameterization of
dataset generation
• Kinds of parameters:
– Characterizing dataset as a whole
– Characterizing individual categories that
comprise the dataset
• Cardinality
• Coherence
• Document contents and language
Defining metrics for
predicting dataset hardness
• Goal: creating datasets with varying
degree of difficulty for text categorization
• Means: measuring conceptual distance
between categories
– Intuition: larger distance easier dataset
– Controlling the degree of separability of the
two categories
– Creating datasets with predictable difficulty
Maximum Achievable
Accuracy (MAA)
• How difficult is a given dataset for learning
algorithms ?
– Just run the algorithms on it !
• Well, SVM will probably suffice … not really • No single algorithm is best in all cases
– SVM outperformed by KNN: 22%
– SVM outperformed by C4.5: 29%
– KNN outperformed by C4.5: 39%
– Differences are often statistically significant
Maximum Achievable
Accuracy (cont’d)
• Intuition: dataset is hard if it is difficult
even for the best classification algorithm
• distMAA ( c1 , c2 ) = max Accuracy a ( c1 , c2 )
a∈C
where c1,c2 are categories in the dataset, and C is a set
of classifiers
• Problem:
very inefficient for “generate-and-test”
– Download, clean and organize the documents
– Classify the dataset and compute the MAA
Solution: heuristic metrics
Goal: develop efficient metrics correlated
with MAA
1. Edge-counting graph metric
dist graph ( c1 , c2 ) = # edges in the tree path from c1 to c2
distgraph = 5
c2
c1
Heuristic metrics (cont’d)
2. WordNet-based textual metric
1
dist ( D1 , D2 ) =
dist ( w, D2 )
∑
| D1 | w∈D1
dist ( w, D ) = min sim( w, w' )
w '∈D
dist ( D1 , D2 ) + dist ( D2 , D1 )
disttext ( c1 , c2 ) =
2
c1
D1
c2
D2
Methodology for automatic
dataset generation
• Raw data acquisition
– Locate a pair of suitable categories
• Graph metric - easy
• Text metric – iterated hill-climbing
– Documents are collected from chosen
categories and all their sub-categories
– Each Web site yields one synthetic document
(concatenated individual Web pages)
Methodology for automatic
dataset generation (cont’d)
• Filtering the data
– Pre-processing: prune directory branches
– Online: disregard external links
– Post-processing
• Weak filtering: discards pages with little text
• Strong filtering: discards unrelated pages
(outliers by content) using the text metric
• Collected data is rendered in an XML-like
format
Empirical
Latin
–
to
call,
evaluation
summon
• Using the Open Directory Project
– 6M sites, 550K categories, 61K editors
• Dataset acquisition system: Accio
– 300 datasets of varying difficulty
• Text categorization infrastructure
– Classifiers:
SVM,
C4.5
Incantation
forKNN,
the Summoning
– Measuring
classification
accuracy
Charm, which
causes
an object
•called
10-fold cross-validation
for to fly to the caster
Distance metrics vs.
SVM accuracy
0.95
0.9
0.85
0.8
Correlation: 0.831
0.75
0.7
0.8
0.75
0.7
0.65
Text distance
28
26
24
22
20
18
16
14
12
0.6
10
Correlation: 0.564
0.85
8
Graph distance
0.9
6
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0.95
4
0.6
1
2
0.65
Classification accuracy (SVM)
Classification accuracy (SVM)
1
Correlation between
the two metrics
25
Text distance
20
15
10
5
0
2
3 4 5
6 7
8 9 10 11 12 13 14 15 16 17 18 19
Graph distance
Correlation: 0.614
Distance metrics vs. MAA
1
0.95
0.85
0.8
0.75
0.7
Correlation: 0.776
0.65
1
0.6
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0.95
Graph distance
Correlation: 0.583
0.9
Max accuracy
Max accuracy
0.9
0.85
0.8
0.75
0.7
0.65
0.6
2
5
8
11
14
17
Text distance
20
23
26
29
Versatility of
dataset generation
Number of category pairs (millions)
14
12
10
8
6
4
2
0
1
5
9
13
Graph Distance
17
21
Versatility of
dataset generation (cont’d)
Number of category pairs
(sampled distribution)
350
300
250
200
150
100
50
0
1
5
9
13
17
21
Text distance
25
29
33
Summary
• Methodology and system for automatic
acquisition of labeled datasets for TC
– Datasets possess predefined characteristics
– Suitable for a variety of TC flavors
• Similarity metrics for document sets as
reliable predictors of classification accuracy
– Maximum Achievable Accuracy (MAA)
– graph and text metrics
– Verified on 300 datasets generated from ODP
Summary (cont’d)
• Filtering raw data to cope with noise
• Large collection of TC datasets will be
made publicly available presently
– The Accio system will also be released later
Further research
directions
• More sophisticated distance metrics
• Additional parameters of categories
• Advanced filtering techniques
– Focused crawling
• Efficient algorithms for locating categories
subject to user’s requirements
• Datasets with more than 2 categories
– Generalize distance metrics appropriately
Questions?