Pruning long documents for distributed information retrieval

Pruning Long Documents for
Distributed Information Retrieval
Jie Lu
Jamie Callan
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
[email protected]
[email protected]
ABSTRACT
Query-based sampling is a method of discovering the contents of
a text database by submitting queries to a search engine and
observing the documents returned. In prior research sampled
documents were used to build resource descriptions for automatic
database selection, and to build a centralized sample database for
query expansion and result merging. An unstated assumption was
that the associated storage costs were acceptable.
When sampled documents are long, storage costs can be large.
This paper investigates methods of pruning long documents to
reduce storage costs. The experimental results demonstrate that
building resource descriptions and centralized sample databases
from the pruned contents of sampled documents can reduce
storage costs by 54-93% while causing only minor losses in the
accuracy of distributed information retrieval.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval
General Terms
Performance, Experimentation
Keywords
Distributed Information Retrieval, Document Pruning
1. INTRODUCTION
Distributed IR has emerged as a way of providing ad-hoc search
capabilities in environments that contain many searchable text
databases ("search engines") [2, 3, 6, 12]. The goal is to provide a
single interface for ad-hoc, multi-database search. A central site
(a "distributed search engine") selects an appropriate set of
databases for each query ("resource selection"), submits the query
to the selected databases, and merges the ranked lists of
documents returned by each database into a single, integrated
ranked list ("result merging").
Resource selection requires resource descriptions that describe
the contents of each database. In a cooperative environment (e.g.,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CIKM’02, November 4-9, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-492-4/02/0011…$5.00.
an intranet), each database provides an accurate resource
description when requested. In an uncooperative environment
(e.g., the Internet), the distributed search engine must discover the
contents of each database, for example, by submitting queries and
observing what is returned ("query-based sampling"). The two
approaches produce similar results for most queries [3, 4].
Our research interest is distributed IR in large-scale,
uncooperative environments, so we have worked extensively with
query-based sampling of databases. Because the only information
needed to construct the resource descriptions used by most
resource selection algorithms is term and term frequency
information, it is not necessary to store the original (sampled)
documents after they are analyzed. However, some solutions to
other distributed IR tasks, for example query expansion [10] and
result merging [9], require more information. Our solution was to
store the sampled documents in a centralized sample database. A
centralized sample database is a corpus created by sampling
documents from all of the databases available. For some tasks it
is a useful surrogate for the whole distributed IR environment.
The centralized sample database must satisfy two requirements: i)
it must represent each individual database relatively accurately,
and ii) the total centralized sample database size must not be too
large, for efficiency and scalability reasons. If the environment
contains 1,000 databases, and if 300 documents are sampled from
each database (a typical sample size in our research), then the
centralized sample database contains 1,000 × 300 = 300,000
documents. This is not large by IR standards.
The use of query-based sampling to build resource descriptions
and centralized sample databases works well for TREC data [9,
10]. The real world presents different challenges. The U.S.
Government Printing Office (USGPO) provides access to almost
2,400 databases of government documents and publications.
Many of these documents are much longer than TREC documents.
For example, in one of our tests with USGPO data, the typical
sampled document size was 440 kilobytes (KB) on average. The
estimated size of a centralized sample database for this
environment is 2,400 databases × 300 documents × 440 KB = 302
gigabytes. We would prefer it not to use so much space.
Resource descriptions created by query-based sampling are
partial resource descriptions, because they are based on only a
small subset of the documents in a database. This paper studies
the effects of further reducing the sizes of resource descriptions
and centralized sample databases by pruning terms from the
documents that are used to build them. In particular, we show
that pruned representations allow a large decrease in the sizes of
these data resources while causing only a small reduction in
resource selection accuracy and overall retrieval performance.
The following section describes related work. Section 3 outlines
different methods of pruning documents. Sections 4 and 5 discuss
the data resources and evaluation methodologies used in our
experiments. Experimental results are presented in Section 6.
Section 7 concludes.
2. RELATED WORK
An IR system’s storage space can be reduced by reducing index
sizes. Reducing index sizes is an old topic in IR, encompassing
techniques as varied as compression, term removal (e.g., stopword
lists), and pruned posting lists (i.e., recording some occurrences of
a term but not others). The latter set of techniques is the most
relevant to the research reported in this paper. A recent and
illustrative example of pruning postings is described in [5]. Given
a cutoff threshold, if the score of a term in a document is smaller
than the threshold, the posting is eliminated from the index. The
posting list of each term is pruned independently of other terms.
Index size reductions of 60-70% degraded average precision
about 10% for short queries.
One can view the pruning of posting lists as generating document
summaries by removing terms from documents. There is a large
body of prior research on document summarization, and some
prior research suggests that ad-hoc retrieval from indexes of
document summaries is as effective as retrieval from indexes of
complete document texts [11]. In general index pruning and
document summarization have been viewed as separate problems
with different solutions, because index pruning may produce a
summary that is useful for some task, but unintelligible.
However, the prior utility of human-readable summaries for adhoc retrieval [11] and query expansion [8] suggests that they can
also be considered for reducing index sizes.
Much of the recent summarization research has been on
techniques that select sentences from the document to generate
either query-biased or generic summaries. We are particularly
interested in using Luhn’s keyword-based clustering measure,
which was shown recently to produce summaries that support
effective query-expansion [8]. We describe this measure in
Section 3.3. It is an open question whether such summaries
would also be useful for resource selection, or for use in a
centralized sample database that supports result merging.
Pruning terms from documents generates partial document
representations. Craswell et al [6] showed that result merging in a
distributed IR environment can be performed with only partial
document representations. In their distributed IR experiments,
databases were selected, searched, and then either the complete
document or just the first part of the document was downloaded.
A tf.idf score was calculated for the downloaded texts, and this
“global” document score was used for merging results from
different databases. The use of partial document representations
degraded accuracy by less than 10%. This approach of result
merging is somewhat costly in terms of communication and
computation, but it demonstrates that partial document
representations can be effective for distributed IR tasks.
All in all, the prior research suggests that i) index sizes can be
reduced (without hurting much accuracy) by using partial
document representations, and ii) document summarization
techniques may be effective on creating partial document
representations that are effective for distributed IR tasks.
3. PRUNING METHODS
Our goal is to reduce the sizes of long documents without
reducing their utility for building resource descriptions and
centralized sample databases that support other retrieval tasks.
We call this pruning by removing words that are not selected.
The pruning methods presented here fall into two categories:
frequency-based and location-based. Frequency-based pruning
uses term frequency information to measure the importance of the
terms and to prune relatively unimportant terms or segments of
the document. Frequency-based methods favor concepts that
occur repeatedly in a document, and so might be expected to
better represent its contents as a whole. Location-based pruning
selects terms based on their positions in the document. Locationbased pruning is simple and fast, and typically favors the
beginning (or other specified) parts of a document.
The pruning methods can also be distinguished based on whether
multiple occurrences of a term are allowed in the pruned
document (multiple-occurrences methods vs. single-occurrence
methods). A multiple-occurrences method will better reflect term
frequency distributions, but may select fewer unique terms than a
single-occurrence method.
Below we present eight methods of pruning documents. Four are
frequency-based (TF, TFIDF, LUHNM, and LUHNS) and four
are location-based (FIRSTM, FIRSTS, RANDM and RANDS).
Three are multiple-occurrences methods (LUHNM, FIRSTM and
RANDM) and five are single-occurrence methods (TF, TFIDF,
LUHNS, FIRSTS and RANDS). All eight content pruning
methods prune each document independently of other documents.
3.1 Term Frequency (TF)
Given a threshold r, the term frequency pruning method ranks all
unique non-stopword terms in the document by within-document
term frequencies and selects terms ranked higher than r. Each
term occurs only once in the pruned document. The method is
based on the idea that within-document frequency is an indicator
of the importance of a non-stopword term in the document.
3.2 Term Frequency and Inverse Document
Frequency (TFIDF)
The motivation of the tf.idf pruning method came from the widely
used tf.idf term weighting scheme in retrieval tasks. It ranks all
unique non-stopword terms in the document by their tf.idf scores
calculated as:
S = (log tf + 1) × (log N + 1)
df
where tf is the within-document term frequency, df is the number
of sampled documents that contain the term, and N is the number
of documents sampled from the database.
Given a ranking threshold r, terms ranked higher than r are
selected and the rest are discarded. Each term occurs only once in
the pruned document.
3.3 Luhn’s Keyword-Based Clustering
(LUHNM and LUHNS)
Luhn’s keyword-based clustering measures the importance of a
sentence in the document [8]. It is used, for example, to select
sentences for use in a document summary [8].
Other
summarization techniques could also be used. In this paper, we
Table 4.1 Summary statistics for the databases in Testbed I.
Type of
docs
short
medium
long
very long
# docs
# dbs
8
72
13
7
min
36,946
8,039
2,999
754
max
39,713
13,602
4,124
1,236
Avg doc len (words)
min
24
399
1,210
4,011
max
129
603
1,701
6,408
are interested in the effectiveness of using summaries for
distributed IR tasks. The comparison of different summarization
techniques in this environment is not our research focus here.
A sentence has a high score if it has a dense cluster of
“significant” words. A word is classified as “significant” if it is
not a stopword and its within-document term frequency is larger
than a threshold T defined as T = 7 + I × 0.1 × | L – n | where n is
the number of sentences in the document, L is 25 for n < 25 and
40 for n > 40, | L − n | is the absolute value of L − n, and I is 0
for 25 ≤ n ≤ 40 and 1 otherwise.
Clusters of significant words are created such that significant
words are separated by no more than 5 insignificant words within
the sentence. The significance score for a cluster is S = SW 2 / TW
where SW is the number of significant words in the cluster (both
cluster boundaries are significant words), and TW is total number
of words in the cluster.
The sentence score is defined as the score of the most significant
cluster it contains. Sentences are ranked by their scores. Given a
threshold number of terms t, if multiple occurrences of a unique
term are allowed, non-stopword terms are selected from topranked sentences until the number reaches t (LUHNM);
otherwise, unique non-stopword terms from top-ranked sentences
are selected until the number reaches t (LUHNS).
3.4 First Part of the Document (FIRSTM and
FIRSTS)
The first part of the document pruning methods select the first t
non-stopword terms if multiple occurrences of a unique term are
allowed (FIRSTM), or the first t unique non-stopword terms
otherwise (FIRSTS). The rest of the terms are discarded. These
methods simulate an environment in which only the first part of a
document is downloaded during query-based sampling, for
example to reduce communication costs.
3.5 Randomly Selected Terms of the
Document (RANDM and RANDS)
The randomly selected terms pruning methods choose positions in
the document randomly and select all corresponding nonstopword terms if multiple occurrences of a unique term are
allowed (RANDM), or those corresponding non-stopword terms
that haven’t been selected for this document otherwise (RANDS)
until the number reaches t. The rest of the terms are discarded.
Although the selection criterion is location-based, it can also be
regarded as frequency-based because high frequency terms occur
in more locations and thus are more likely to be selected.
4. MULTI-DATABASE TESTBEDS
The pruning methods were tested on two distributed IR testbeds.
Testbed I consists of 100 databases obtained by dividing data
from TREC CDs 1, 2, and 3 based on source and publication date.
This testbed has been used in prior research on query-based
Table 4.2 Summary statistics for the databases in Testbed II.
Type of
docs
short
medium
long
very long
# docs
# dbs
8
72
7
2
min
36,946
8,039
3,547
2,357
max
39,713
13,602
4,124
2,777
Avg doc len (words)
min
24
399
1,353
12,047
max
129
603
1,618
12,854
sampling, resource selection, and result merging in distributed
information retrieval [3, 9]. These 100 databases can be grouped
based on average document length and total number of
documents. Summary statistics are shown in Table 4.1.
TREC topics 51-150 were used to run our experiments. These
topics were chosen because they have relevance judgments for all
parts of TREC CDs 1, 2, and 3. Only title fields were used in our
experiments, so queries are short. The standard TREC relevance
assessments supplied by the U. S. National Institute for Standards
and Technology were used [7].
Testbed II is a variation of Testbed I created specifically to test
the effects of pruning in environment where some databases have
extremely long documents (i.e., to simulate an environment like
the USGPO). Documents from the 1988 Federal Register were
grouped by content, using k-means clustering algorithm.
Documents within a cluster were sorted into chronological order,
and then subsequences were concatenated to create very long
documents. The resulting fr88_merge database had 2,357
documents with an average length of 12,854 terms. The same
process was applied to the 1993 U.S. Patents documents. The
resulting patent_merge database had 2,777 documents with an
average length of 12,047 terms.
The fr88_merge and
patent_merge databases replaced 13 corresponding databases in
Testbed I, resulting in Testbed II, which contains 89 databases.
Summary statistics are shown in Table 4.2.
The relevance assessments for Testbed II were adjusted
accordingly. If any component of a very long “concatenated”
document was relevant to a topic, the very long document was
also considered relevant. This is consistent with NIST assessment
methodology in which a document is judged relevant if any part
of it is relevant [7].
5. EVALUATION METHODOLOGY
Distributed IR systems are affected by the performance of each
individual component as well as the interaction between different
components. Past research showed that it is important to evaluate
directly where a change is made in a system, and also to evaluate
the effects of the change on “downstream” components [3]. In
our case, that means evaluating how well pruned resource
descriptions match unpruned resource descriptions, and also how
they affect the accuracies of resource selection and document
retrieval. Each type of evaluation is discussed below.
5.1 Measuring the Accuracy of Learned
Resource Descriptions
Resource descriptions are usually based on vocabulary and term
frequency information [3, 12]. Prior research on query-based
sampling [3] compared resource descriptions using ctf ratio and
the Spearman rank correlation coefficient.
Adopting the
terminology of that research, we call these the “actual”
description (created from all documents of a database) and the
“learned” description (created from sampled documents).
The ctf ratio (collection term frequency ratio) measures how well
the learned vocabulary matches the actual vocabulary. It is the
proportion of term occurrences in the database that are covered by
terms in the learned resource description. A ctf ratio of 80%
means that the learned resource description contains the terms that
account for 80% of all term occurrences in the database. For a
learned vocabulary V’ and an actual vocabulary V, ctf ratio is:
∑i∈V ’ ctfi
∑i∈V ctfi
where ctfi is the collection term frequency of (unique) term i.
The Spearman rank correlation coefficient compares the
orderings of terms ranked by frequency in learned resource
description and actual resource description. It is defined as:
1−
6
1
1
(∑ d i2 + ∑ ( f k3 − f k ) + ∑ ( g m3 − g m ))
12 k
12 m
n3 − n i
1−
∑( f
3
k
− fk )
k
n3 − n
× 1−
∑ (g
3
m
− gm )
m
n3 − n
where:
di is the rank difference of common (unique) term i;
n is the number of unique terms in the learned resource
description;
fk is the number of ties in the kth group of ties in the learned
resource description; and
gm is the number of ties in the mth group of ties in the actual
resource description.
The rank correlation coefficient has value 1 when two orderings
are identical, −1 when they are in reverse order, and 0 when they
are uncorrelated.
Stopwords were removed from both the learned and actual
language models prior to comparison with ctf ratio and Spearman
rank correlation coefficient.
5.2 Measuring the Accuracy of Database
Rankings
The database ranking metric Rn is a standard metric in distributed
IR research [3]. It is analogous to the Recall metric for document
rankings. If the desired database ranking is a relevance-based
ranking that ranks databases by the number of relevant documents
they contain for a query, Rn is defined as:
∑
∑
n
Rn =
i =1
n
i =1
NumRel (dbei )
NumRel (dbri )
where:
n is the number of databases to be ranked;
th
dbei is the i ranked database in estimated database ranking;
th
dbri is the i ranked database in relevance-based ranking.
5.3 Measuring the Accuracy of Document
Rankings
The quality of document rankings is measured by the well-known
Precision and Recall metrics.
We measured precision at
document ranks 5, 10, 15, 20, 30, and 100.
6. EXPERIMENTAL RESULTS
A series of experiments was conducted to study the effects of the
pruning methods described in Section 3. Resource descriptions
and a centralized sample database were created using (pruned or
unpruned) documents obtained by query-based sampling [4].
Query-based sampling was conducted with single-term queries
selected randomly from documents already sampled from the
database, and 4 documents examined per query. 300 documents
were sampled from each database. This methodology is consistent
with prior research [3, 9, 10].
For Testbed II, sampled documents were pruned only if they
belong to the “very long documents” databases fr88_merge and
patent_merge. This choice guaranteed that any differences would
be due only to pruning very long documents. The average length
of documents sampled from Testbed I was around 1,600 terms
(about 400 unique terms), so the pruning thresholds for the “very
long documents” databases in Testbed II were set to 1,600 for
multiple-occurrences methods and 400 for single-occurrence
methods. The average length of documents sampled from “very
long documents” databases fr88_merge and patent_merge was
13,292 terms, so the pruning ratios were about 8:1 for multipleoccurrences methods and 33:1 for single-occurrence methods.
For Testbed I, all sampled documents were subject to pruning.
The pruning thresholds for Testbed I were chosen to yield the
same pruning ratios on “long documents” and “very long
documents” databases as were used in Testbed II. The average
length of documents sampled from “long documents” and “very
long documents” databases in Testbed I was 3,332 terms. Using
the (approximately) 8:1 pruning ratio (multiple-occurrences
methods) and 33:1 pruning ratio (single-occurrence methods)
results in thresholds of 400 terms for multiple-occurrences
methods and 100 terms otherwise.
Databases were ranked with the well-known CORI database
ranking algorithm. CORI has been described often in the research
literature [2, 3, 9, 10], so we only sketch it briefly. The belief of
database dbi with respect to query term rk is p(rk | dbi) defined as:
p(rk | dbi) = b + (1 − b) × T × I
dfi
dfi + 50 + 150 × cwi / avg _ cw
| DB | +0.5
log(
)
cf
I=
log(| DB | +1.0)
T=
where:
dfi is the number of documents in dbi that contain rk;
cf is the number of databases that contain rk;
| DB | is the number of databases to be ranked;
cwi is the number of words in dbi;
avg_cw the average cw of the databases to be ranked;
b is the default belief, usually 0.4.
The combination of belief values for query terms is the average of
their beliefs in our experiments.
The 10 top-ranked databases for a query were selected for search,
as in prior distributed IR research [2, 3, 9, 10]. The selected
databases were searched using the INQUERY search engine [1].
Document rankings produced by different databases must be
merged into a single ranking. The CORI result-merging algorithm
is a standard heuristic used with the CORI database ranking and
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.5
0.4
0.6
0.5
0.4
0.3
TFIDF (8)
0.2
0.1
0
ctf ratio
1
0.9
ctf ratio
ctf ratio
1
0.9
50
UNPRUNED (1)
FIRSTM (4)
LUHNM (3)
FIRSTS (9)
LUHNS (5)
RANDM (2)
TF (7)
RANDS (6)
100
150
200
number of documents examined
TFIDF (9)
0.2
0.1
250
300
Figure 6.1a Average ctf ratio of resource
descriptions learned for 13 “long
documents” databases in Testbed I.
0.5
0.4
0.3
0
0.6
50
UNPRUNED (1)
FIRSTM (3)
LUHNM (4)
FIRSTS (7)
LUHNS (6)
RANDM (2)
TF (8)
RANDS (5)
100
150
200
number of documents examined
0.3
0.1
250
300
Figure 6.1b Average ctf ratio of resource
descriptions learned for 7 “very long
documents” databases in Testbed I.
INQUERY document retrieval algorithms [2, 3, 9, 10], so we used
it in the research reported here. A “normalized” document score
D’ suitable for merging is calculated as [3]:
D − Dmin
R − Rmin
× (1 + 0.4 × i
)
D
−
D
R
min
max − Rmin
D’ = max
1.4
where:
D is the document score returned by database i;
Ri is the ranking of database i;
Rmax and Rmin are the highest and lowest scores respectively
that the database ranking algorithm could potentially assign
to a database;
Dmax and Dmin are the highest and lowest document scores
respectively that are returned by database i.
We also used a newer result-merging algorithm that on a queryby-query basis uses a centralized sample database to generate
unsupervised training data for a regression-based normalizing
function [9].
Sampled documents from all databases are
combined into a single centralized sample database, and this
database is indexed with INQUERY. After 1,000 documents
were retrieved from a selected database, another 5,000 documents
were retrieved from the centralized sample database. Documents
that occurred in both lists (“overlap documents”) served as
training data for a linear regression algorithm that learned to
normalize document scores returned by this database (see [9] for
details). Each selected database learned its own normalizing
function. Finally a merged ranking of 100 documents were
returned for each query based on the normalized document scores.
The regression-based result-merging algorithm was chosen
because the centralized sample database introduces another
potential sensitivity to pruned documents. Retrieval of pruned
documents might generate less accurate document rankings,
which would reduce the effectiveness of result merging.
6.1 Effects of Content Pruning on the
Accuracy of Resource Descriptions
Our first set of experiments measured how well resource
descriptions created from pruned documents matched resource
descriptions created from unpruned documents. In this set of
TFIDF (9)
0.2
0
50
UNPRUNED (1)
FIRSTM (3)
LUHNM (4)
FIRSTS (8)
LUHNS (6)
RANDM (2)
TF (7)
RANDS (5)
100
150
200
number of documents examined
250
300
Figure 6.1c ctf ratio of resource
descriptions learned for fr88_merge in
Testbed II.
experiments we studied only “long documents” and “very long
documents” databases (Section 4), because very few documents in
“short documents” and “medium documents” databases were long
enough to be pruned. The experimental results for both testbeds
are summarized in Figures 6.1 (ctf ratio) and 6.2 (Spearman rank
correlation coefficient). The baselines were generated from
unpruned documents. In some cases it is difficult to tell curves
apart, so each curve’s rank at the “100 sampled documents” point
is shown in parenthesis. Because results for fr88_merge and
patent_merge in Testbed II are very similar, only results for
fr88_merge are shown here due to limited space.
The ctf ratio measures how well a resource description represents
the vocabulary of a database. The results for the ctf ratio show
that even using pruned documents, the vocabulary that represents
at least 70% of the non-stopword term occurrences in each
database can still be identified quickly. After sampling 300
documents, the difference between ctf ratios for resource
descriptions produced from pruned documents and those
produced from unpruned documents is less than 15%.
Methods based on randomly selected terms are better than those
based on Luhn’s keyword-based clustering and the first part of the
document. The difference between multiple-occurrences methods
was usually small. The same is true for single-occurrence
methods. The difference between multiple-occurrences and
single-occurrence methods is due to the larger vocabulary found
using multiple-occurrences methods with given thresholds. On
Testbed I the average number of unique terms in pruned
documents using multiple-occurrences methods with threshold
400 is 193 compared with 100 using single-occurrence methods.
On Testbed II, the average number of unique terms of pruned
documents using multiple-occurrences methods with threshold
1,600 is 530 compared with 400 using single-occurrence methods.
The Spearman rank correlation coefficient measures how well a
resource description represents the term frequency distribution in
a database. Accumulating more terms during sampling may not
necessarily lead to more accurate orderings of terms by frequency
in learned resource description. This explains why the curve of
the Spearman rank correlation coefficient is not always
monotonic. The result that sometimes the unpruned performed a
little bit worse than the pruned is due to the same reason.
0.8
0.7
0.6
0.5
0.4
0.3
TFIDF (9)
0.2
0.1
0
50
UNPRUNED (1)
FIRSTM (6)
LUHNM (7)
FIRSTS (8)
LUHNS (4)
RANDM (2)
TF (5)
RANDS (3)
100
150
200
number of documents examined
250
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Figure 6.2a Average Spearman rank
correlation coefficient of resource
descriptions learned for 13 “long
documents” databases in Testbed I.
TFIDF (9)
0.2
0.1
0
300
50
UNPRUNED (1)
FIRSTM (6)
LUHNM (8)
FIRSTS (7)
LUHNS (5)
RANDM (2)
TF (3)
RANDS (4)
100
150
200
number of documents examined
250
6.2 Effects of Content Pruning on the
Accuracy of Database Rankings
A second set of experiments studied the effects of the pruned
resource descriptions on the ability of the CORI algorithm to rank
databases. Figure 6.3 summarizes the recall values of database
rankings for Testbeds I and II. The rank of each curve at the “10
selected databases” point is provided in parenthesis.
For Testbed I, multiple-occurrences methods had higher Recall
than single-occurrence methods, presumably due to better term
frequency information. The difference between using pruned and
unpruned contents for resource descriptions was less than 10%.
For Testbed II, there was hardly any difference in recall using
different pruning methods. The difference between using pruned
300
0.7
0.6
0.5
0.4
0.3
TFIDF (9)
0.2
0.1
50
UNPRUNED (5)
FIRSTM (8)
LUHNM (7)
FIRSTS (6)
LUHNS (3)
RANDM (4)
TF (2)
RANDS (1)
100
150
200
number of documents examined
250
Figure
6.2c
Spearman
rank
correlation coefficient of resource
description learned for fr88_merge in
Testbed II.
A third set of experiments studied the effects of the pruned
resource descriptions on how well two result-merging algorithms
merge document rankings.
The experimental results are
summarized in Tables 6.1 (Testbed I) and 6.2 (Testbed II). The
baseline results were generated from unpruned documents.
In general, the effect of pruning on the accuracy of document
rankings was consistent with its effect on the accuracy of database
rankings. The results indicate that creating resource descriptions
from pruned documents led to less than 10% (usually less than
5%, which few users would notice) relative performance loss
using the LUHNM, LUHNS, RANDM, and FIRSTM pruning
methods. The performance loss using RANDS, FIRSTS, TF, and
TFIDF methods was usually less than 15%.
Generally, multiple-occurrences methods provided better accuracy
(less degradation) than single-occurrence methods. However, the
0.8
0.8
0.7
0.7
0.6
0.6
R(n)
1
0.9
0.5
0.4
0.4
0.3
0.3
TFIDF (7)
0.2
0.1
10
20
UNPRUNED (1)
FIRSTM (3)
LUHNM (3)
FIRSTS (8)
LUHNS (2)
RANDM (6)
TF (5)
RANDS (9)
30
40
50
60
70
n = number of databases
80
UNPRUNED (9)
LUHNM (1)
LUHNS (1)
TF (6)
0.2
0.1
90
100
Figure 6.3a Database ranking recall for Testbed I.
300
6.3 Effects of Content Pruning on the
Accuracy of Document Rankings
1
0
0.8
and unpruned contents was also rarely observable. If we focus on
the top 10 of the ranking, still multiple-occurrences methods had
higher recall than single-occurrence methods. In general, content
pruning didn’t result in much degradation of database rankings.
0.9
0.5
1
0.9
0
Figure 6.2b Average Spearman rank
correlation coefficient of resource
descriptions learned for 7 “very long
documents” databases in Testbed I.
Methods based on randomly selected terms appear to generate
more accurate orderings of terms by frequency than those based
on Luhn’s keyword-based clustering and the first part of the
document. However, unlike with ctf ratio, it is not necessarily the
case that multiple-occurrences methods worked better than singleoccurrence methods. Fairly accurate estimation of term orderings
by frequency at the database level can still be obtained by singleoccurrence methods (except TFIDF) even if we ignore withindocument term frequency when pruning documents.
R(n)
Spearman rank correlation coefficient
Spearman rank correlation coefficient
Spearman rank correlation coefficient
1
0.9
TFIDF (6)
FIRSTM (1)
FIRSTS (1)
RANDM (1)
RANDS (6)
0
10
20
30
40
50
60
n = number of databases
70
80
Figure 6.3b Database ranking recall for Testbed II.
Table 6.1a Precision of document rankings created by the CORI result-merging algorithm on Testbed I.
Rank
5 docs
10 docs
15 docs
20 docs
30 docs
100 docs
Fullcontent
TF
0.438 0.390 (-10.9%)
0.404 0.346 (-14.3%)
0.379 0.330 (-12.8%)
0.361 0.313 (-13.3%)
0.337 0.295 (-12.6%)
0.260 0.223 (-14.5%)
TFIDF
0.396 (-9.6%)
0.366 (-9.4%)
0.343 (-9.3%)
0.333 (-7.6%)
0.308 (-8.5%)
0.237 (-8.8%)
LUHNM
0.414 (-5.5%)
0.379 (-6.2%)
0.363 (-4.2%)
0.349 (-3.5%)
0.326 (-3.2%)
0.250 (-4.0%)
Precision
LUHNS
FIRSTM
0.428 (-2.3%) 0.414 (-5.5%)
0.378 (-6.4%) 0.378 (-6.4%)
0.366 (-3.3%) 0.363 (-4.0%)
0.349 (-3.5%) 0.340 (-5.9%)
0.327 (-3.0%) 0.315 (-6.4%)
0.239 (-8.3%) 0.245 (-5.9%)
FIRSTS
0.400 (-8.7%)
0.359 (-11.1%)
0.352 (-7.0%)
0.338 (-6.5%)
0.311 (-7.6%)
0.236 (-9.4%)
RANDM
0.418 (-4.6%)
0.382 (-5.4%)
0.361 (-4.5%)
0.338 (-6.4%)
0.317 (-5.8%)
0.247 (-5.3%)
RANDS
0.400 (-8.7%)
0.354 (-12.4%)
0.343 (-9.3%)
0.326 (-9.8%)
0.303 (-10.1%)
0.236 (-9.6%)
Table 6.1b Precision of document rankings created by the regression-based result-merging algorithm on Testbed I.
Rank
5 docs
10 docs
15 docs
20 docs
30 docs
100 docs
Fullcontent
TF
0.434 0.412 (-5.1%)
0.400 0.382 (-4.5%)
0.381 0.357 (-6.3%)
0.368 0.328 (-10.7%)
0.337 0.303 (-10.0%)
0.260 0.215 (-17.3%)
TFIDF
0.404 (-6.9%)
0.369 (-7.8%)
0.353 (-7.5%)
0.337 (-8.4%)
0.310 (-7.9%)
0.237 (-8.7%)
LUHNM
0.410 (-5.5%)
0.392 (-2.0%)
0.367 (-3.8%)
0.352 (-4.3%)
0.326 (-3.3%)
0.252 (-3.0%)
Precision
LUHNS
FIRSTM
0.422 (-2.8%) 0.404 (-6.9%)
0.384 (-4.0%) 0.379 (-5.2%)
0.367 (-3.8%) 0.373 (-2.3%)
0.352 (-4.3%) 0.346 (-5.8%)
0.323 (-4.1%) 0.319 (-5.2%)
0.250 (-3.6%) 0.247 (-4.9%)
FIRSTS
0.412 (-5.1%)
0.378 (-5.5%)
0.360 (-5.6%)
0.338 (-8.0%)
0.313 (-7.1%)
0.229 (-11.7%)
RANDM
0.428 (-1.4%)
0.393 (-1.7%)
0.377 (-1.2%)
0.351 (-4.5%)
0.328 (-2.6%)
0.253 (-2.6%)
RANDS
0.404 (-6.9%)
0.374 (-6.5%)
0.359 (-5.9%)
0.334 (-9.1%)
0.303 (-10.0%)
0.230 (-14.0%)
Table 6.2a Precision of document rankings created by the CORI result-merging algorithm on Testbed II.
Rank
5 docs
10 docs
15 docs
20 docs
30 docs
100 docs
Fullcontent
TF
0.436 0.388 (-11.0%)
0.411 0.371 (-9.7%)
0.396 0.355 (-10.3%)
0.388 0.353 (-9.2%)
0.364 0.327 (-10.1%)
0.274 0.249 (-9.0%)
TFIDF
0.370 (-15.1%)
0.352 (-14.3%)
0.341 (-13.8%)
0.341 (-12.2%)
0.316 (-13.2%)
0.246 (-10.0%)
LUHNM
0.442 (+1.4%)
0.426 (+3.6%)
0.405 (+2.2%)
0.401 (+3.1%)
0.371 (+1.8%)
0.277 (+1.2%)
Precision
LUHNS
FIRSTM
0.404 (-7.3%) 0.442 (+1.4%)
0.388 (-5.6%) 0.426 (+3.6%)
0.367 (-7.2%) 0.405 (+2.2%)
0.366 (-5.7%) 0.400 (+2.9%)
0.339 (-6.9%) 0.369 (+1.4%)
0.251 (-8.3%) 0.277 (+1.2%)
FIRSTS
0.374 (-14.2%)
0.369 (-10.2%)
0.353 (-10.8%)
0.352 (-9.5%)
0.328 (-10%)
0.248 (-9.5%)
RANDM
0.430 (-1.4%)
0.411 (0.0%)
0.396 (0.0%)
0.391 (+0.7%)
0.361 (-0.9%)
0.270 (-1.3%)
RANDS
0.382 (-12.4%)
0.371 (-9.7%)
0.355 (-10.4%)
0.353 (-9.1%)
0.331 (-9.1%)
0.249 (-8.9%)
Table 6.2b Precision of document rankings created by the regression-based result-merging algorithm on Testbed II.
Rank
5 docs
10 docs
15 docs
20 docs
30 docs
100 docs
Fullcontent
TF
0.440 0.428 (-2.7%)
0.415 0.409 (-1.4%)
0.403 0.389 (-3.3%)
0.378 0.371 (-1.7%)
0.352 0.346 (-1.5%)
0.260 0.258 (-0.9%)
TFIDF
0.430 (-2.3%)
0.398 (-4.1%)
0.379 (-5.9%)
0.365 (-3.3%)
0.343 (-2.5%)
0.254 (-2.3%)
LUHNM
0.440 (0.0%)
0.413 (-0.5%)
0.395 (-2.0%)
0.378 (+0.1%)
0.354 (+0.6%)
0.261 (+0.1%)
Precision
LUHNS
FIRSTM
0.424 (-3.6%) 0.424 (-3.6%)
0.393 (-5.3%) 0.401 (-3.4%)
0.378 (-6.1%) 0.383 (-5.0%)
0.367 (-2.9%) 0.371 (-1.8%)
0.346 (-1.5%) 0.350 (-0.5%)
0.258 (-0.9%) 0.259 (-0.5%)
thresholds for single-occurrence methods were more aggressive,
and produced smaller pruned documents. If the thresholds for
single-occurrence methods were increased, the gap between
single-occurrence and multiple-occurrences methods might be
expected to shrink.
On Testbed I, before content pruning, the CORI and regressionbased result-merging algorithms yielded similar results. After
content pruning, the regression-based result-merging algorithm
was clearly more robust. We had expected the regression-based
result-merging algorithm to be more sensitive to pruned
documents because the (pruned) documents in the centralized
sample database are the training data for learning how to
normalize document scores. On Testbed II, in most cases the
regression-based algorithm was more effective than the CORI
result-merging algorithm. These experimental results are an
encouraging sign of stability for the regression-based resultmerging algorithm. They also suggest that, for this task, a
centralized sample database created from pruned documents is as
effective as a database constructed from complete documents.
6.4 Measuring the Effect on Storage Costs
In our experiments, three types of data were affected by pruning:
i) the resource selection index, ii) the centralized sample database,
and iii) the centralized sample index. Table 6.3 compares the
FIRSTS
0.420 (-4.5%)
0.386 (-7.0%)
0.372 (-7.6%)
0.364 (-3.7%)
0.342 (-2.7%)
0.255 (-2.1%)
RANDM
0.432 (-1.8%)
0.404 (-2.6%)
0.387 (-4.0%)
0.374 (-1.0%)
0.348 (-1.0%)
0.258 (-1.0%)
RANDS
0.412 (-6.4%)
0.388 (-6.5%)
0.373 (-7.3%)
0.357 (-5.4%)
0.339 (-3.7%)
0.256 (-1.7%)
average sizes of these data resources before and after pruning on
Testbed I. The percentage figures indicate the reduction in disk
space relative to the (unpruned) baseline. Table 6.4 compares the
space required to store the sampled documents from the “very
long documents” databases in Testbed II (fr88_merge,
patent_merge) before and after pruning.
These results demonstrate that the pruning of long documents
enables dramatic reductions in the storage requirements of three
different types of data used in distributed information retrieval.
7. CONCLUSIONS
This paper investigates the effects of pruning documents obtained
by query-based sampling on the performance of a distributed IR
system. Pruned documents were used to build resource
descriptions for resource selection, and a centralized sample
database for result merging. Our results demonstrate that pruned
documents support resource descriptions, database ranking, and
ad-hoc document retrieval accuracies comparable to the
accuracies obtained using unpruned documents. However, the
amount of storage space was reduced 54-93%, which can be
considerable savings when documents are very long.
Our results show that multiple-occurrences methods are slightly
better than single-occurrence methods.
Single-occurrence
methods are sufficient for database ranking, which ignores within-
Table 6.3 The sizes of data resources created from pruned and
unpruned documents in Testbed I.
Threshold
Resource selection
index size
Centralized sample
DB size
Centralized sample
index size
Fullcontent
N/A
Multipleoccurrences
400
Singleoccurrence
100
24 MB
11 MB (54.17%)
9 MB (62.50%)
258 MB
46 MB (82.17%)
19 MB (92.64%)
110 MB
43 MB (60.91%)
25 MB (77.27%)
Table 6.4 The space taken by documents from fr88_merge and
patent_merge in the centralized sample database before and
after content pruning in Testbed II.
Threshold
Centralized
sample DB size
Fullcontent
N/A
Multipleoccurrences
1,600
Singleoccurrence
400
51 MB
7 MB (86.27%)
2 MB (96.08%)
document term frequency.
Regression-based result-merging
includes a document retrieval step that is more effective using the
within-document term frequency that multiple-occurrences
methods provide. However, single-occurrence methods only
require about half the space of multiple-occurrences methods, and
are surprisingly effective even for the result-merging task.
Location-based pruning methods worked well in our tests, even
though they ignore frequency, which was a surprise. In general,
LUHNM, LUHNS, FIRSTM and RANDM pruning give good and
stable performance.
These results provide additional support for the new regressionbased result-merging algorithm [9]. This algorithm provided
higher baseline results than the CORI result-merging algorithm,
and more consistent results across the range of pruning methods.
Future research topics include investigating the effects of pruning
in environments with multiple search engine types and different
resource selection algorithms, characterizing the bias caused by
pruning, and pruning tailored for specific distributed IR tasks.
8. ACKNOWLEDGMENTS
We thank Luo Si and Paul Ogilvie for their help with the research
reported here. This material is based on work supported by NSF
grant IIS-0096139. Any opinions, findings, conclusions or
recommendations expressed in this material are the authors’, and
do not necessarily reflect those of the sponsor.
9. REFERENCES
[1] J. Allan, J. Callan, M. Sanderson, J. Xu and S. Wegmann.
INQUERY and TREC-7. In Proc. of the Seventh Text
Retrieval Conference (TREC-7). 1999.
[2] J. Callan, Z. Lu and W. B. Croft. Searching distributed
collections with inference networks. In Proc. of the 18th
Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. 1995.
[3] J. Callan. Distributed information retrieval. W. B. Croft,
editor, Advances in information retrieval, chapter 5, pages
127-150. Kluwer Academic Publishers, 2000.
[4] J. Callan and M. Connell. Query-based sampling of text
databases. Transactions on Information Systems, 19(2),
pages 97-130. 2001.
[5] D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y.
S. Maarek and A. Soffer.
Static index pruning for
information retrieval systems. In Proc. of the 24th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval. 2001.
[6] N. Craswell, D. Hawking and P. Thistlewaite. Merging
results from isolated search engines. In Proc. of the Tenth
Australasian Database Conference. 1999.
[7] D. Harman, editor.
Proc. of the Third Text Retrieval
Conference (TREC-3). 1995.
[8] A. Lam-Adesina and G. Jones. Applying summarization
techniques for term selection in relevance feedback. In Proc.
of the 24th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval. 2001.
[9] S. Luo and J. Callan. Using sampled data and regression to
merge search engine results. In Proc. of the 25th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval. 2002.
[10] P. Ogilvie and J. Callan.
The Effectiveness of query
expansion for distributed information retrieval. In Proc. of
the Tenth International Conference on Information
Knowledge Management (CIKM 2001). 2001.
[11] T. Sakai and K. Sparck-Jones.
Generic summaries for
indexing in information retrieval. In Proc. of the 24th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval. 2001.
[12] B. Yuwono and D. L. Lee. Server ranking for distributed
text retrieval systems on the Internet. In Proc. of the Fifth
International Conference on Database Systems for Advanced
Applications (DASFAA), pages 41-49. 1997.