Download extended version of this paper.

RJ 10025 (90521)
Computer Science
May 29, 1996
(Revised 3/20/98)
Research Report
ESTIMATING THE NUMBER OF CLASSES IN
A FINITE POPULATION
Peter J. Haas
IBM Research Division
Almaden Research Center
650 Harry Road
San Jose, CA 95120-6099
Lynne Stokes
Department of Management Science and
Information Systems
University of Texas
Austin, TX 78712
LIMITED DISTRIBUTION NOTICE
This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication.
It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the
outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specic
requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g.,
payment of royalties).
IBM
Research Division
Yorktown Heights, New York
San Jose, California
Zurich, Switzerland
ESTIMATING THE NUMBER OF CLASSES IN
A FINITE POPULATION
Peter J. Haas
IBM Research Division
Almaden Research Center
650 Harry Road
San Jose, CA 95120-6099
e-mail: [email protected]
Lynne Stokes
Department of Management Science and
Information Systems
University of Texas
Austin, TX 78712
e-mail: [email protected]
ABSTRACT: We use an extension of the generalized jackknife approach of Gray and
Schucany to obtain new nonparametric estimators for the number of classes in a nite
population of known size. We also show that generalized jackknife estimators are closely
related to certain Horvitz-Thompson estimators, to an estimator of Shlosser, and to estimators based on sample coverage. In particular, the generalized jackknife approach leads to
a modication of Shlosser's estimator that does not suer from the erratic behavior of the
original estimator. The performance of both new and previous estimators is investigated
by means of an asymptotic variance analysis and a Monte Carlo simulation study.
Keywords: jackknife, sample coverage, number of species, number of classes, database,
census
1. Introduction
The problem of estimating the number of classes in a population has been studied for
many years. A recent review article (Bunge and Fitzpatrick 1993) lists more than 125
references. In this article, we consider an important special case of the general problem |
estimating the number of classes in a nite population of known size. Only a handful of
papers have addressed this problem and none has reached an entirely satisfactory solution,
despite the fact that the rst attempt at solution appeared in the statistical literature nearly
50 years ago (Mosteller 1949). The problem we consider has arisen in the literature in a
variety of applications, including the following.
(i) In a company-sponsored contest, many entries (say several hundred thousand) have
been received. It is known that some people have entered more than once. The goal is
to estimate the number of dierent people who have entered from a sample of entries
(Mosteller 1949; Sudman 1976).
(ii) A sampling frame is constructed by combining a number of lists that may contain
overlapping entries. It is desired to estimate, using a sample from all lists, the number
of units on the combined list (Deming and Glasser 1959; Goodman 1952; Kish 1965,
Sec. 11.2; Sudman 1976, Sec. 3.6). An important example of such a problem is an
\administrative records census," currently under study by the U.S. Bureau of the
Census. In such a census, several administrative les (such as AFDC or IRS records)
are combined, and the total number of distinct individuals included in the combined
le is determined. Exact computation of the number of distinct individuals in the
combined le is extremely expensive because of the high cost of determining the
number of duplicated entries. A similar problem and proposed solution was discussed
in the London Financial Times (March 2, 1949) by C. F. Carter, who was interested
in estimating the number of dierent investors in British industrial stocks based on
samples from share registers of companies (Mosteller 1949).
(iii) In a relational database system, data are organized in tables called relations (see,
e.g., Korth and Silberschatz 1991, Chap. 3). In a typical relation, each row might
represent a record for an individual employee in a company, and each column might
correspond to a dierent attribute of the employee, such as salary, years of experience,
department number, and so forth. A relational query species an output relation that
is to be computed from the set of base relations stored by the system. Knowledge
of the number of distinct values for each attribute in the base relations is central
to determining the most ecient method for computing a specied output relation
(Hellerstein and Stonebraker 1994; Selinger, Astrahan, Chamberlain, Lorie, and Price
1979). The size of the base relations in modern database systems often is so large
that exact computation of the distinct-value parameters is prohibitively expensive,
and thus estimation of these parameters is desired (Astrahan, Schkolnick, and Whang
1987; Flajolet and Martin 1985; Gelenbe and Gardy 1982; Hou, Ozsoyoglu, and Taneja
1
1988, 1989; Naughton and Seshadri 1990; Ozsoyoglu, Du, Tjahjana, Hou, and Rowland
1991; Whang, Vander-Zanden, and Taylor 1990).
In each of these applications, the size of the population (number of contest entries, total
number of units over all lists, and number of rows in the base relation) is known, and this
size is too large for easy computation of the number of classes.
The problem studied in this article can be described formally as follows. A population
of size N consists of D mutually disjoint classesPof items, labelled C1 ; C2 ; : : : ; CD . Dene
Nj to be the size of class Cj , so that N = Dj=1 Nj . A simple random sample of n
items is selected (without replacement) from the population. This sample includes nj items
from class Cj . The problem we consider is that of estimating D using information from
the sample along with knowledge of the value ofP N . We denote by Fi the number of
classes of size i in the population, so that D = Ni=1 Fi . Similarly, we denote by fi the
number of classes represented exactly i times inPthe sample P
and by d the total number of
classes represented in the sample. Thus d = ni=1 fi and ni=1 ifi = n. Dene vectors
N = (N1 ; N2 ; : : : ; ND ), n = (n1; n2; : : : ; nD ), and f = (f1 ; f2; : : : ; fn). Note that n is not
observable, but f is. Because we sample without replacement, the random vector n has a
multivariate hypergeometric distribution with probability mass function
,N ,N ,ND P (n j D; N) = n n,N nD :
1
1
2
2
n
(1)
The probability mass function of the observable random vector f is simply P (n j D; N)
summed over all points n that correspond to f :
P (f j D; N) =
X
S
P (n j D; N);
where S = f n : #(nj = i) = fi for 1 i D g. The probability mass function P (f j D; N)
does not have a closed-form expression in general.
In Section 2 we review the estimators that have been proposed for estimating D from
data generated under model (1). In Section 3 we provide several new estimators of D based
on an extension of the generalized jackknife approach of Gray and Schucany (1972). We
then show that generalized jackknife estimators of the number of classes in a population are
closely related to certain \Horvitz-Thompson" estimators, to an estimator due to Shlosser
(1981), and to estimators based on the notion of \sample coverage" (Chao and Lee 1992).
In Section 4 we provide and compare approximate expressions for the asymptotic variance
of several of the estimators, and in Section 5 apply our formulas to a well-known example
from the literature. We provide a simulation-based empirical comparison of the various
estimators in Section 6, and summarize our results and give recommendations in Section 7.
2
2. Previous Estimators
Bunge and Fitzpatrick (1993) mention only two non-Bayesian estimators that have been
developed as estimators of D under model (1). These are the estimators of Goodman (1949)
and Shlosser (1981). Goodman proved that
X
Db Good1 = d + (,1)i+1 (N , n + i , 1)! (n , i)! fi
n
(N , n , 1)! n!
i=1
is the unique unbiased estimator of D when n > M def
= max(N1 ; N2 ; : : : ; ND ). He further
proved that no unbiased estimator of D exists when n M . Unfortunately, unless the
sampling fraction is quite large, the variance of Db Good1 is so great and the numerical difculties encountered when computing Db Good1 are so severe that the estimator is unusable.
Goodman, who made note of the high variance of Db Good1 himself, suggested the alternative
estimator
, 1)
Db Good2 = N , Nn((N
n , 1) f2
for overcoming the variance problem. Although Db Good2 has lower variance than Db Good1 , it
can take on negative values and can have a large bias for any n if D is small. For example,
consider the case in which D = 1 and n > 2, and observe that f2 = 0 and Db Good2 = N .
Under the assumption that the population size N is large and the sampling fraction
q = n=N is nonnegligible, Shlosser (1981) derived the estimator
Pn
bDSh = d + f1 Pn i=1(1 , q)ii,fi1 :
iq(1 , q) fi
i=1
For the two examples considered in his paper, Shlosser found that use of Db Sh with a 10%
sampling fraction resulted in an error rate below 20%. In our experiments, however, we
observed root mean squared errors (rmse's) exceeding 200%, even for well-behaved populations with relatively little variation among the class sizes (see Sec. 6). Considering the
relationship between Db Sh and generalized jackknife estimators (see Sec. 3.4) provides insight
into the source of this erratic behavior and suggests some possible modications of Db Sh to
improve performance.
In related work, Burnham and Overton (1978, 1979) proposed a family of (traditional)
generalized jackknife estimators for estimating the size of a closed population when capture
probabilities vary among animals. The D individuals in the population play the role of our
D classes; a given individual can appear up to n times in the overall sample if captured
on one or more of n possible trapping occasions. The capture probability for an individual
is assumed to be constant over time, and the capture probabilities for the D individuals
are modeled as D iid random samples from a xed probability distribution. Burnham and
3
Overton's sample design is clearly dierent from model (1). Under the Burnham and Overton model, for example, the quantities f1 ; f2 ; : : : ; fn have a joint multinomial distribution.
Closely related to the work of Burnham and Overton are the ordinary jackknife estimators
of the number of species in a closed region developed by Heltshe and Forrester (1983) and
Smith and van Belle (1984). The sample data consist of a list of the species that appear in
each of n quadrats. (The number of times that a species is represented in a quadrat is not
recorded.) This setup is essentially identical to that of Burnham and Overton, with the D
species playing the role of the D individuals and the n quadrats playing the role of the n
trapping occasions.
3. Generalized Jackknife Estimators
In this section we outline an extension of the generalized jackknife approach to bias
reduction and then use this approach to derive new estimators for the number of classes
in a nite population. We also point out connections between our generalized jackknife
approach and several other estimation approaches in the literature.
3.1. The Generalized Jackknife Approach
Let be an unknown real-valued parameter. A generalized jackknife estimator of is
an estimator of the form
b b
G(b1 ; b2 ) = 11,,RR2 ;
(2)
where b1 and b2 are biased estimators of and R (6= 1) is a real number (Gray and Schucany
1972). The idea underlying the generalized jackknife approach is to try and choose R such
that G(b1 ; b2 ) has lower bias than either b1 or b2 . To motivate the choice of R, observe that
for
b
(3)
R = E [b1 ] , ;
E [2 ] , the estimator G(b1 ; b2 ) is unbiased for . This optimal value of R is typically unknown,
however, and can only be approximated, resulting in bias reduction but not complete bias
elimination. In the following, we extend the original denition of the generalized jackknife
given by Gray and Schucany (1972) by allowing R to depend on the data; that is, we allow
R to be random.
Recall that d is the number of classes represented in the sample. Write dn for d to
emphasize the dependence of d on the sample size, and denote by dn,1 (k) the number of
classes represented in the sample after the kth observation has been removed. Set
d(n,1) = n1
n
X
k=1
4
dn,1 (k):
We focus on generalized jackknife estimators that are obtained by taking b1 = dn and
b2 = d(n,1) in (2); these are the usual choices for b1 and b2 in the classical rst-order
jackknife estimator (Miller 1974). Observe that dn,1 (k) = dn , 1 if the class for the
kth observation is represented only once in the sample; otherwise, dn,1(k) = dn. Thus
d(n,1) = dn , (f1=n) and, by (2), G(b1 ; b2 ) = Db , where
Db = dn + K fn1
(4)
and K = R=(1 , R). It follows from (3) that the optimal choice of K is
dn ] :
K = E [dE [dn]] ,, ED[d ] = DE,[fE] [=n
(5)
n
1
(n,1)
To derive a more explicit formula for K , denote by I [A] the indicator of event A and observe
that
2D
3 D
D
X
X
X
5
4
E [dn ] = E
I [nj > 0] = P f nj > 0 g = D , P f nj = 0 g :
j =1
j =1
Similar reasoning shows that
D
X
P f nj = 1 g ;
(6)
PD P f n = 0 g
j
K = n PDj=1
:
P fn = 1g
(7)
E [f1 ] =
so that
j =1
j =1
j =1
j
Following Shlosser (1981), we focus on the case in which the population size N is large and
the sampling fraction q = n=N is nonnegligible, and we make the approximation
Nj P f nj = k g k qk (1 , q)Nj ,k
(8)
for 0 k n and 1 j D. That is, the probability distribution of each nj is approximated by the probability distribution of nj under a Bernoulli sample design in which each
item is included in the sample with probability q, independently of all other items in the
population. Use of this approximation leads to estimators that behave almost identically to
estimators derived using the exact distribution of n but are simpler to compute and derive
(see App. A for further discussion). Substituting (8) into (7), we obtain
PD
N
j =1 (1 , q) j :
N ,1
j =1 Nj q(1 , q) j
K n PD
5
(9)
The quantity K dened in (9) depends on unknown parameters N1 ; N2 ; : : : ; ND that are
dicult to estimate. Our approach is to approximate K by a function of D and of other
parameters that are easier to estimate, thereby obtaining an approximate version of (4). The
estimates for these parameters, including Db for D, are then substituted into the approximate
version of (4) and the resulting equation is solved for Db .
We also consider \smoothed" jackknife estimators. The idea is to replace the quantity
f1=n in (4) by its expected value E [f1] =n in the hope that the resulting estimator of D will
be more stable than the original \unsmoothed" estimator. As with the parameter K , the
quantity E [f1 ] =n depends on the unknown parameters N1 ; N2 ; : : : ; ND ; see (6) and (8).
Thus our approach to estimating E [f1 ] =n is the same as our approach to estimating K .
Estimators also can be based on high-order jackkning schemes that consider the number of distinct values in the sample when two elements are removed, when three elements
are removed, and so forth. Typically, using a high-order jackkning scheme requires estimating high-order moments (skewness, kurtosis, and so forth) of the set of numbers
f N1; N2 ; : : : ; ND g. Initial experiments indicated that the reduction in estimation error
due to using the high-order jackknife is outweighed by the increase in error due to uncertainty in the moment estimates. Thus we do not pursue high-order jackknife schemes
further.
3.2. The Estimators
Dierent approximations for K and E [f1 ] =n lead to dierent estimators for D. Here
we develop a number of the possible estimators.
3.2.1. First-Order Estimators The simplest estimators of D can be derived using a
rst-order approximation to K . Specically, approximate each Nj in (9) by the average
value
N = D1
D
X
j =1
Nj = N
D
and substitute the resulting expression for K into (4) to obtain
Db = dn + (1 , nq)f1 D :
(10)
,1
bDuj1 = 1 , (1 , q)f1 dn:
n
(11)
Now substitute Db for D on the right side of (10) and solve for Db . The resulting solution,
denoted by Db uj1 , is given by
We refer to this estimator as the \unsmoothed rst-order jackknife estimator."
6
To derive a \smoothed rst-order jackknife estimator," observe that by (6) and (8),
D
E [f1 ] 1 X
Nj ,1
n
n j =1 Nj q(1 , q) :
(12)
Approximating each Nj in (12) by N , we have
E [f1 ] (1 , q)N ,1:
(13)
n
On the right side of (10), replace f1 =n with the approximate expression for E [f1 ] =n given
in (13), yielding
Db = dn + D(1 , q)N :
Replacing D with Db and N with N=Db in the foregoing expression leads to the relation
,
Db 1 , (1 , q)N=Db = dn :
We dene the smoothed rst-order jackknife estimator Db sj1 as the value of Db that solves
this equation. Given dn , n, and N , Db sj1 can be computed numerically using standard
root-nding procedures. Observe that if in fact N1 = N2 = = ND = N=D, then
,
E [dn] D 1 , (1 , q)N=D :
In this case Db sj1 can be viewed as a simple method-of-moments estimator obtained by
replacing E [dn ] with the estimate dn and solving for D. If, moreover, the sampling fraction
q is small enough so that the distribution of (n1 ; n2 ; : : : ; nD ) is approximately multinomial
(see Sec. 3.3), then Db sj1 is approximately equal to the maximum likelihood estimator for
D (see Good 1950). Observe that both Db uj1 and Db sj1 are consistent for D: Db uj1 ! D and
Db sj1 ! D as q ! 1.
3.2.2. Second-Order Estimators A second-order approximation to K can be derived
as follows. Denote by 2 the squared coecient of variation of the class sizes N1 ; N2 ; : : : ; ND :
2 =
P
(1=D) Dj=1(Nj , N )2
N2
:
(14)
Suppose that 2 is relatively small, so that each Nj is close to the average value N . Substitute the Taylor approximations
(1 , q)Nj (1 , q)N + (1 , q)N ln(1 , q)(Nj , N )
7
and
Nj
q(1 , q)Nj ,1 N
jq
(1 , q)N ,1 + (1 , q)N ,1 ln(1 , q)(Nj , N )
for 1 j D into (9) to obtain
K D(1 , q)
1
,
D(1 , q) 1 , ln(1 , q)N 2 :
(15)
1 + ln(1 , q)N 2
The unknown parameter 2 can be estimated
, using the following approach (cf. Chao and
Lee 1992). With the usual convention that mn = 0 for n < m, we nd that
N
X
i=1
i(i , 1)E [fi] N
X
i=1
= q2
= q2
so that
2 i(i , 1)
D
X
j =1
D
X
j =1
D X
Nj i
j =1
Nj (Nj , 1)
Nj ,i
i q (1 , q)
Nj X
Nj , 2 i,2
i=2
Nj ,i
i , 2 q (1 , q)
Nj (Nj , 1);
N
D , 1:
DX
i
(
i
,
1)
E
[
f
]
+
i
2
n i=1
N
Thus if D were known, then a natural method-of-moments estimator ^2 (D) of 2 would be
n
X
D , 1:
^2 (D) = max 0; nD2 i(i , 1)fi + N
i=1
To develop a second-order estimate of D, substitute (15) into (4) to obtain
,
Db = d + Df1(1 , q) 1 , ln(1 , q)N 2 ;
n
from which it follows that
n
, q) :
Db = dn + Df1 (1n , q) , f1(1 , q) ln(1
q
(16)
(17)
2
Replacing D with Db on the right side of this equation and solving for Db yields the relation
f (1 , q) 2
Db = d , f1(1 , q) ln(1 , q) :
(18)
1, 1
n
n
8
q
An estimator of D can be obtained by substituting ^ 2 (Db ) for 2 in (18) and solving for
Db numerically. Alternatively, we can start with a simple initial estimator of D and then
correct this estimator using (18). Following this latter approach, we use Db uj1 as our initial
estimator and dene
!
,1
2 (D
b
f
(1
,
q
)
f
(1
,
q
)
ln(1
,
q
)^
)
1
1
uj1
Db uj2 = 1 , n
dn ,
:
q
A smoothed second-order jackknife estimator can be obtained by replacing the expression f1 =n in (17) with the approximation to E [f1 ] =n given in (13), leading to
,
Db = dn + D(1 , q)N 1 , ln(1 , q)N 2 :
Replacing D with Db and proceeding as before, we obtain the estimator
Db sj2 = 1 , (1 , q)N~
,1 dn , (1 , q)N~ ln(1 , q)N ^ 2 (Db uj1 ) ;
where N~ = N=Db uj1 . As with the rst-order estimators Db uj1 and Db sj1 , the second-order
estimators Db uj2 and Db sj2 are consistent for D.
3.2.3. Horvitz-Thompson Jackknife Estimators In this section we discuss an al-
ternative approach to estimation of K based on a technique of Horvitz and Thompson.
(See Sarndal, Swensson, and Wretman 1992 for a general discussion of Horvitz-Thompson
estimators.)
First, consider the general problem of estimating a parameter of the form
P
D
(g) = j =1 g(Nj ), where g is a specied function. Observe that because P f nj > 0 g > 0
for 1 j D, we have (g) = E [X (g)], where
X (g ) =
D
X
g(Nj )I (nj > 0)
j =1
X g(Nj )
=
:
P f nj > 0 g
fj :nj >0g P f nj > 0 g
It follows from (8) that P f nj > 0 g 1 , (1 , q)Nj , and the foregoing discussion suggests
that we estimate (g) by
b(g) =
X
g(Nbj ) ;
Nb
fj :nj >0g 1 , (1 , q) j
(19)
where Nbj is an estimator for Nj . The key point is that we need to estimate Nj only when
nj > 0. To do this, observe that
E [nj j nj > 0] = P fEn[n>j ] 0 g 1 , (1qN,j q)Nj :
j
9
Replacing E [nj j nj > 0] with nj leads to the estimating equation
nj = 1 , (1qN,j q)Nj ;
(20)
and a method-of-moments estimator Nbj can be dened as the value of Nj that solves (20).
Now consider the problem of estimating K , and hence D. By (9), K (f )=(g), where
f (x) = (1 , q)x and g(x) = xq(1 , q)x,1 =n. Thus a natural estimator of K is given by
b(f )=b(g), leading to the nal estimator,
b
Db HTj = dn + b(f ) fn1 :
(g )
A smoothed variant of Db HTj can be obtained by replacing f1 =n with the Horvitz-Thompson
estimator of E [f1 ] =n, namely b(g). The resulting estimator, denoted by Db HTsj, is given by
Db HTsj = dn + b(f ):
Finally, a hybrid estimator can be obtained using a rst-order approximation for the numerator of K and a Horvitz-Thompson estimator for the denominator. This leads to the
estimator Db hj , dened as the solution Db of the equation
!
N=Db
= dn :
Db 1 , f1(1 ,b q)
n(g)
If we replace f1 =n with the Horvitz-Thompson estimator for E [f1 ] =n in the foregoing
equation in order to obtain a smoothed variant of Db hj , then the resulting estimator coincides
with Db sj1 .
Because D = (u), where u(x) 1, it may appear that a \non-jackknife" HorvitzThompson estimator Db HT can be dened by setting Db HT = b(u). It is straightforward
to show, however, that Db HT = Db HTsj , so that Db HT can in fact be viewed as a smoothed
jackknife estimator.
Simulation experiments indicate that the behavior of the Horvitz-Thompson jackknife
estimators Db HTj and Db HTsj is erratic (see App. D for detailed results). Overall, the poor
performance of Db HTj and Db HTsj is caused by inaccurate estimation of b(f ). The problem
seems to be that when Nj is small, the estimator Nbj is unstable
and yet typically has a
bj ,
N
b
large eect on the value of (f ) through the term (1 , q) = 1 , (1 , q)Nbj . The estimator
Db hj uses a Taylor approximation in place of b(f ) and hence has lower bias and rmse than
the other two Horvitz-Thompson jackknife estimators. However, other estimators perform
better than Db hj , and we do not consider the estimators Db HTj , Db HTsj, and Db hj further.
10
3.3. Relation to Estimators Based on Sample Coverage
The generalized jackknife approach for deriving an estimator of D works for sample
designs other than hypergeometric sampling. For example, the most thoroughly studied
version of the number-of-classes problem is that in which the population is assumed to
be innite and n is assumed to have a multinomial distribution with parameter vector
= (1 ; 2 ; : : : ; D ); that is,
n
P (n j D; ) = n n n 1n1 2n2 DnD :
1 2
D
(21)
When we proceed as in Section 3.1 to derive a generalized jackknife estimator under the
model in (21), the estimator turns out to be nearly identical to the \coverage-based" estimator proposed by Chao and Lee (1992). To see this, start again with (4) and select K as
in (5). Because
E [ dn ] , D = ,
D
X
j =1
(1 , j )n
under the model in (21), it follows that
PD v ( )
K = PD j =1 n j ;
v ( )
j =1 j n,1 j
where vn (x) = (1 , x)n . Set = 1=D and use the Taylor approximations
vn(j ) vn () + (j , )vn0 ()
and
,
j vn,1 (j ) j vn,1( ) + (j , )vn0 ,1 ( )
in a manner analogous to the derivation in Section 3.2.2 to obtain
P
K (D , 1) + (n , 1) 2 ;
(22)
where 2 = ,1+ D Dj=1 j2 is the squared coecient of variation of the numbers 1 ; 2 ; : : : ;
D . Denote by Db mult the estimator of D under the multinomial model. Then, by (4),
,
Db mult = dn + (D , 1) + (n , 1) 2 fn1 :
(23)
Replace D with Db mult and 2 with an estimator ~ 2 in (23) and solve for Db mult to obtain
Cb) n , 1 ~ 2 , 1 ;
Db mult = dbn + n(1 ,
n
n
C
Cb
11
where Cb = 1 , (f1 =n). When the sample size n is large, the estimator Db mult is essentially
the same as the estimator
b
Db = dn + n(1 , C ) ~2
CL
Cb
Cb
proposed by Chao and Lee (1992). The estimator Db CL was developed from a dierent
point of view, using the P
concept of sample coverage. The sample coverage for an innite
population is dened as Dj=1 j I [nj > 0], and the quantity Cb = 1 , (f1 =n) is a standard
estimator of the sample coverage.
Conversely, when Chao and Lee's derivation is modied to account for hypergeometric sampling, the resulting estimator is equal to Db uj2 (see App. B). Thus at least some
estimators based on sample coverage can be viewed as generalized jackknife estimators.
3.4. Relation to Shlosser's Estimator
Observe that the estimator Db Sh , though not developed from a jackknife perspective, can
be viewed as an estimator of the form (4) with K estimated by
Pn
bKSh = n Pn i=1(1 , q)ii,fi1 :
i=1 iq(1 , q) fi
To analyze the behavior of Db Sh , we rst rewrite the jackknife quantity K dened in (9) as
follows:
PN
i
(24)
K = n PN i=1(1 , q)i,F1i :
iq
(1
,
q
)
F
i
i=1
b
Shlosser's justication of DSh assumes that
E [fi ] Fi
(25)
E [f1 ] F1
for 1 i N . When the assumption in (25) holds and the sample size is large enough so
that
fi E [fi]
(26)
for 1 i N ,
PN
i
KbSh n PN i=1 (1 , q)i,E1[fi ]
i=1 iq(1 , q) , E [fi ]
P
N (1 , q)i E [f ] =E [f ]
i
1
= n PN i=1
,
i
,
1
E [fi ] =E [f1 ]
i=1 iq(1 , q)
,1 PN (1 , q)i Fi
F
1
n ,1 PN i=1
F1 i=1 iq(1 , q)i,1Fi
= K;
12
so that Db Sh behaves as a generalized jackknife estimator. Although the relations in (25)
and (26) hold exactly for n = N (implying that Db Sh is consistent for D), these relations
can fail drastically for smaller sample sizes. For example, when F1 = 0 and Fi > 0 for some
i > 1, the right side of (25) is innite, whereas the left side is nite for n suciently small.
This observation leads one to expect that Db Sh will not perform well when the sample size is
relatively small and N1 ; N2 ; : : : ; ND have similar values (with Nj > 1 for each j ). Both the
variance analysis in Section 4 and the simulation experiments described in Section 6 bear
out this conjecture.
The foregoing discussion suggests that replacing Kb Sh with
=
KbSh
K Kb
E [KbSh ] Sh
(27)
is unbiased
in the formula for Db Sh might result in an improved estimator, because Kb Sh
for K . Of course we cannot perform this replacement exactly, since K and E [KbSh ] are
as follows. Using the fact that
unknown, but we can approximate Kb Sh
E [fr ] =
D
X
j =1
P f nj = r g D X
Nj r
j =1
r q
(1 , q)Nj ,r
=
N X
i r
i=r
i,r
r q (1 , q) Fi
(28)
for 1 r n, we have, to rst order,
PN (1 , q)i E [f ]
i
b
E [KSh ] n PN i=1
i
,
1
PiN=1(1iq(1, ,q)iq,)(1 +Eq[)fii,] 1F
= n i=1
PN iq2 (1 , q2)i,1 F i :
(29)
i
i=1
Using the rst-order approximation N1 = N2 = = ND = N together with (24), (27),
and (29), we nd that
!
N ,1
q(1 + q)
Kb :
Kb Sh
(1 + q)N , 1 Sh
We thus obtain a modied Shlosser estimator given by
Db Sh2 = dn + f1
! P
q(1 + q)N~ ,1 P ni=1 (1 , q)i fi ;
n iq(1 , q)i,1 f
i
(1 + q)N~ , 1
i=1
where N~ is an initial estimate of N based on an initial estimate of D. We set N~ equal to
N=Db uj1 throughout. As with Db Sh , the estimator Db Sh2 is consistent for D.
13
An alternative consistent estimator of D can be obtained by directly using the expressions in (24), (27), and (29) with Fi estimated by
f1 fi
Fbi = Pn iq(1
(30)
, q)i,1 fi
i=1
for 1 i N ; these estimators of F1 ; F2 ; : : : ; FN were proposed by Shlosser (1981) in
conjunction with the estimator Db Sh . Substituting the resulting estimator of K and E [KbSh ]
into (27) leads to the nal estimator
Pn iq2 (1 , q2)i,1 f ! Pn (1 , q)if 2
i
Pn i=1iq(1 , q)i,i1f :
Db Sh3 = dn + f1 Pn i=1 i ,
(1 , q) (1 + q)i , 1 f
i
i
i=1
i=1
As with the estimator Db Sh , Shlosser's justication of the estimators in (30) rests on the
assumption in (25). Thus one might expect that, like Db Sh , the estimator Db Sh3 will be
unstable when the sample size is relatively small and N1 ; N2 ; : : : ; ND have similar values.
relative to K
b Sh leads one to expect that
On the other hand, the reduction in bias of Kb Sh
bDSh3 will perform better than Db Sh when 2 is suciently large. (One might be tempted
to avoid the assumption in (25) when estimating F1 ; F2 ; : : : ; FN by taking a method-ofmoments approach: replace E [fr ] with fr in (28) for 1 r n and solve the resulting
set of linear equations either exactly or approximately. As pointed out by Shlosser (1981),
however, this system of equations is nearly singular, and hence extremely unstable.)
4. Variance and Variance Estimates
Consider an estimator Db that is a function of the sample only through f = (f1 ; f2 ; : : : ;
fM ), where M = max(N1 ; N2 ; : : : ; ND ). All of the estimators introduced in Section 3 are
of this type. In general, we also allow Db to depend explicitly on the population size N and
write Db = Db (f ; N ). Suppose that, for any N > 0 and nonnegative M -dimensional vector
f 6= 0, the function Db is continuously dierentiable at the point (f ; N ) and
Db (cf ; cN ) = cDb (f ; N )
(31)
for c > 0. Approximating the hypergeometric sample design by a Bernoulli sample design
as in (8), we can obtain the following approximate expression for the asymptotic variance
of Db (f ; N ) as D becomes large:
AVar[Db (f ; N )] M
X
i=1
A2i Var [fi] +
X
1i;i 0 M
i6=i
0
Ai Ai0 Cov [fi; fi0 ] ;
(32)
where Ai is the partial derivative of Db with respect to fi , evaluated at the point (f ; N ).
(When computing each Ai , we replace each occurrence of n and dn in the formula for Db by
14
PM if and PM f before taking derivatives.)
i=1 i
i=1 i
The approximation in (32) is valid when
there is not too much variability in the class sizes (see App. C for a precise formulation and
proof of this result). It follows from the proof that, to a good approximation, the variance
of an estimator Db satisfying (31) increases linearly as D increases.
Straightforward calculations show that each of the specic estimators Db uj1 , Db uj2 , Db Sh ,
bDSh2 , and Db Sh3 is continuously dierentiable as stated previously and also satises (31).
Thus we can use (32) to study the asymptotic variance of these estimators. We focus on
Db uj1, Db uj2 , Db Sh2 , and Db Sh3 because each of these estimators performs best for at least one
population studied in the simulation experiments described in Section 6; we also consider
Db Sh , because Db Sh is the most useful of the estimators previously proposed in the literature.
Computation of the Ai coecients for each estimator is tedious, but straightforward. When
Db = Db uj2 , for example, we obtain
(uj1) N (1 , q ) ln(1 , q )
A(uj2)
1 = A1 , n , (1 , q )f
1 "
!#
(uj1) ,
2
2
b
A
D
^
^
2
^2 + f1 b1 ^2 + 1 , n ^2 + 1 , Nuj1 , n , (1 , q)f + n
1
Duj1
and
N (1 , q) ln(1 , q)
A(uj2)
= A(uj1)
i
i , n , (1 , q)f1
!
(uj1) ,
2
2
b
b
A
D
i
(
i
,
1)
D
i
^
i
^
2
i
uj1 ,
f1 bi ^2 + 1 , n ^2 + 1 , Nuj1 +
n2
n , (1 , q)f1 + n
Duj1
for 1 < i n, where ^2 = ^ 2 (Db uj1 ),
A(uj1)
1
= Db uj1
and
A(uj1)
i
1
(1 , q)
f1
+
1
,
dn n , (1 , q)f1
n
= Db uj1
;
1 i(1 , q)(f1=n) +
dn
n , (1 , q)f1
for 1 < i n.
Figures 1 and 2 compare the variances of the estimators Db uj1 , Db uj2 , Db Sh , Db Sh2 , and
Db Sh3 for a number of populations with equal class sizes. For these special populations, Db uj1
and Db uj2 are approximately unbiased, so that the relative variances of these estimators are
appropriate measures of relative performance. It is particularly instructive to compare the
variance of Db uj1 and Db uj2 , since Db uj2 is obtained from Db uj1 by adjusting the latter estimator
to compensate for bias induced by the assumption of equal class sizes. This adjustment is
unnecessary for our special populations, and a comparison allows evaluation of the penalty
(i.e., the increase in variance) that is being paid for the adjustment.
15
standard deviation
standard deviation
800
700
b uj1
D
600
b uj2
D
500
b Sh
D
400
b
DSh2
b Sh3
300
D
200
100
0
0.02 0.06 0.1 0.14 0.18
sampling fraction (q)
Figure 1: Standard deviation of Db uj1 ,
Db uj2, Db Sh , Db Sh2 , and Db Sh3 (D = 15; 000
and N = 10).
160
140
120
100
80
60
40
20
0
b uj1
D
b uj2
D
b Sh2
D
0
20
40 60 80 100
class size (N )
Figure 2: Standard deviation of Db uj1 ,
Db uj2 , and Db Sh2 (D = 1500 and q = 0:10).
Figure 1 displays the standard deviations of Db uj1 , Db uj2 , Db Sh , Db Sh2 , and Db Sh3 for an
equal-class-size population with N = 15; 000 and D = 1500 (so that N = 10) as the
sampling fraction q varies. Observe that Db uj2 is only slightly less ecient than Db uj1 , so
that the penalty for bias adjustment is small in this case. Performance of the estimators
Db uj1 and Db Sh2 is nearly indistinguishable. The most striking observation is that for this
population, Db Sh and Db Sh3 are not competitive with the other three estimators. The relative
performance of Db Sh and Db Sh3 is especially poor for small sampling fractions. On the other
hand, the variance analysis indicates that modication of Db Sh as in (27) and (29) indeed
reduces the instability of the original Schlosser estimator in this case. Thus we focus on the
estimators Db uj1 , Db uj2 , and Db Sh2 in the remainder of this section and in the next section.
(We return to the estimator Db Sh3 in Section 6, where our simulation experiments indicate
that Db Sh3 can exhibit smaller rmse than the other estimators, but only at large sample
sizes and for certain \ill-conditioned" populations in which 2 is extremely large.)
Figure 2 compares the three estimators Db uj1 , Db uj2 , and Db Sh2 for equal-class-size populations with a range of class sizes; for these calculations the number of classes and the
sampling fraction are held constant at D = 1500 and q = 10. This gure illustrates the
diculty of precisely estimating D when the class size is small (but greater than 1). Again,
we see that these three estimators perform similarly, with nearly equal variability when N
exceeds about 40.
We checked the accuracy of the variance approximation in some example populations by
comparing the values computed from (32) with results of a simulation experiment. (This
experiment is discussed more completely in Section 6 below.) Simulated sampling with
q = 0:05, 0:10, and 0:20 from the population examined in Figure 1 (N = 15; 000, D = 1500)
yields variance estimates within 10% (on average) of those calculated from (32). Similar
results were found in sampling from an equal-class-size population with N = 15; 000 and
D = 150. The only diculties we encountered occurred for equal-class-size populations with
16
class sizes of N = 1 and N = 2. For these small class sizes the variance approximation, which
is based on the approximation of the hypergeometric sample design by a Bernoulli sample
design, is not suciently accurate. In particular, the approximate variance strongly reects
random uctuations in the sample size due to the Bernoulli sample design; such uctuations
are not present in the actual hypergeometric sample design. Simulation experiments indicate
that for N 3 the dierences caused by Bernoulli versus hypergeometric sampling become
negligible. (Of course, if the sample design is in fact Bernoulli, then this problem does not
occur.)
In practice, we estimate the asymptotic variance of an estimator Db by substituting
estimates for f Var [fi] : 1 i M g, and f Cov [fi ; fi0 ] : 1 i 6= i0 M g into (32). To
obtain such estimates, we approximate the true population by a population with D classes,
each of size N=D. Under this approximation and the assumption in (8) of a Bernoulli
sample design, the random vector f has a multinomial distribution with parameters D and
p = (p1 ; p2; : : : ; pn), where
pi =
N=D
pbi =
N=Db i
i
(N=D),i
i q (1 , q)
for 1 i n. It follows that Var [fi] = Dpi (1 , pi ) and Cov [fi ; fi0 ] = ,Dpipi0 . Each pi can
be estimated either by
i
q (1 , q)(N=Db ),i
or simply by fi =Db . It turns out that the latter formula yields better variance estimates,
and so we take
and
f
d
Var[fi ] = fi 1 , bi
D
Cd
ov[fi ; fi0 ] = , fibfi
D
0
for 1 i; i0 n. These formulas coincide with the estimators obtained using the \unconditional approach" of Chao and Lee (1992). A computer program that calculates Db uj1 ,
Db uj2, Db Sh2 and their estimated standard errors from sample data can be obtained from the
second author.
5. An Example
The following example illustrates how knowledge of the population size N can aect
estimates of the number of classes. When the population size N is unknown, Chao and
Lee (1992, Sec. 3) have proposed that the estimator Db CL dened in Section 3.3 be used to
17
N
1000
b uj1
D
b uj2
D
b Sh2
D
455 502 455
(47) (60) (51)
10,000 709 788 707
(125) (161) (128)
100,000 752 835 749
(141) (183) (144)
Table 1: Values of Db uj1 , Db uj2 , and Db Sh2 for three hypothetical combined lists. (Standard
errors are in parentheses.)
estimate the number of classes, because the formula for Db CL does not involve the unknown
parameter N . When N is known, a slight modication of the derivation of Db CL leads to
the unsmoothed second-order jackknife estimator Db uj2 (see App. B).
Our example is based on one discussed by Chao and Lee (1992), who borrowed data
rst described and analyzed by Holst (1981). These data arose from an application in
numismatics in which 204 ancient coins were classied according to die type in order to
estimate the number of dierent dies used in the minting process. Among the die types on
the reverse sides of the 204 coins were 156 singletons, 19 pairs, 2 triplets, and 1 quadruplet
(f1 = 156, f2 = 19, f3 = 2, f4 = 1, d = 178). Because the total number of coins minted
is unknown in this case, model (1) is inappropriate for analyzing these data. But suppose
that the same data had arisen from an application in which N was known. For example,
suppose that the data were obtained by selecting a simple random sample of 204 names
from a sampling frame that had been constructed by combining 5 lists of 200 names each
(N = 1000), 50 lists of 200 names each (N = 10; 000), or 500 lists of 200 names each
(N = 100; 000). In each case our object is to estimate the number of unique individuals
on the combined list, based on the sample results. We focus on the three estimators Db uj1 ,
Db uj2, and Db Sh2 . The estimates for the three cases are given in Table 1; the standard errors
displayed in Table 1 are estimated using the procedure outlined in Section 4.
We would expect similar inferences to be made from the same data under the multinomial model and the nite population model when N is very large. Indeed, the value
Db uj2 = 835 agrees closely with Chao and Lee's estimate Db CL = 844 (se 187) when
N = 100;000. Moreover, when N = 100;000 we nd that ^ 2 (Db uj1 ) 0:13, which is the
same estimate of 2 given by Chao and Lee. As the population size decreases, however,
both our assessment of the magnitude of D and our uncertainty about that magnitude
decrease, because we are observing a larger and larger fraction of both the population and
the classes.
The most extreme divergence between the estimate obtained using Db CL and estimates
obtained using Db uj1 , Db uj2 , or Db Sh2 occurs when the sample consists of all singletons (f1 = n).
In that case, Db CL = 1, whereas Db uj1 = Db uj2 = Db Sh2 = N . This result indicates that when
the population size N is known, it is better to use an estimator that exploits knowledge
18
of N than to sample with replacement and use the estimator Db CL . In some applications,
sampling with replacement is not even an option. For example, the only available sampling
mechanism in at least one current database system is a one-pass reservoir algorithm (as in
Vitter 1985).
The empirical results in Section 6 indicate that, of the three estimators displayed in
Table 1, Db uj2 is the superior estimator when 2 is small (< 1). Thus for our example, Db uj2
would be the preferred estimator, since ^ 2 (Db uj1 ) 0:13 in all three cases. Note that Db uj2
consistently has the highest variance of the three estimators in Table 1. The bias of Db uj2
is typically lower than that of Db uj1 or Db Sh2 when 2 is small, however, so that the overall
rmse is lower.
6. Simulation Results
This section describes the results of a simulation study done to compare the performance
of the various estimators described in Section 3. Our comparison is based on the performance of the estimators for sampling fractions of 5%, 10%, and 20% in 52 populations.
(Initial experiments indicated that the performance of the various estimators is best viewed
as a function of sampling fraction, rather than absolute sample size. This is in contrast to
estimators of, for example, population averages.)
We consider several sets of populations. The rst set comprises synthetic populations of
the type considered in the literature. Populations EQ10 and EQ100 have equal class sizes of
10 and 100. In populations NGB/1, NGB/2, and NGB/4, the class sizes follow a negative
binomial distribution. Specically, the fraction f (m) of classes in population NGB/k with
class size equal to m is given by
m , 1 f (m) k , 1 rk (1 , r)m,k
for m k, where r = 0:04. Chao and Lee (1992) considered populations of this type.
The populations in the second set are meant to be representative of data that could be
encountered when a sampling frame for a population census is constructed by combining a
number of lists which may contain overlapping entries. Population GOOD and SUDM were
studied by Goodman (1949) and Sudman (1976). Population FRAME2 mimics a sampling
frame that might arise in an administrative records census of the type described in Section 1.
One approach to such a census is to augment the usual census address list with a small
number of relatively large administrative records les, such as AFDC or Food Stamps, and
then estimate the number of distinct individuals on the combined list from a sample. We
have constructed FRAME2 so that a given individual can appear at most ve times, but
most individuals appear exactly once, mimicking the case in which four administrative lists
are used to supplement the census address list. Population FRAME3 is similar to FRAME2,
but for the FRAME3 population it is assumed that the combined list is made up of a number
of small lists (perhaps obtained from neighborhood-level organizations) rather than a few
19
Name
EQ10
EQ100
NGB/4
NGB/2
NGB/1
N
D
2
15000 1500 0.00
15000 150 0.00
82135 874 0.18
41197 906 0.37
20213 930 0.75
Skew
0.00
0.00
0.50
0.81
1.25
Name
N
D
GOOD 10000
9595
FRAME2 33750 19000
FRAME3 111500 36000
SUDM 330000 100000
2
0.04
0.31
0.52
1.87
Skew
5.64
1.18
1.92
2.71
Table 3: Characteristics of \merged list"
populations.
Table 2: Characteristics of synthetic populations.
Name
N
D
2 Skew
Z20A 50000
247 114.38 14.60
Z15 50000
772 166.18 23.44
Z20B 50000 10384 234.81 73.54
Table 4: Characteristics of \ill-conditioned" populations.
large lists. The populations in the third set, denoted by Z20A, Z20B, and Z15, are used
to study the behavior of the estimators when the data are extremely ill-conditioned. The
class sizes in each of these populations follow a generalized Zipf distribution (see Knuth
1973, p. 398). Specically, Nj =N / j , , where equals 1.5 or 2.0. These populations
have extremely high values of 2 . Descriptive statistics for these three sets of populations
are given in Tables 2, 3, and 4. The column entitled \skew" displays the dimensionless
coecient of skewness , which is dened by
PD (N , N )3 =D
= P j=1 j
:
D (N , N )2 =D 3=2
j =1
j
The nal set comprises 40 real populations that demonstrate the type of distributions
encountered when estimating the number of distinct values of an attribute in a relational
database. Specically, the populations studied correspond to various relational attributes
from a database of enrollment records for students at the University of Wisconsin and a
database of billing records from a large insurance company. The population size N ranges
from 15,469 to 1,654,700, with D ranging from 3 to 1,547,606 and 2 ranging from 0 to
81.63 (see App. D for further details). It is notable that values of 2 encountered in the
literature (Chao and Lee 1992; Goodman 1949; Shlosser 1981; Sudman 1976) tend not to
exceed the value 2, and are typically less than 1, whereas the value of 2 exceeds 2 for more
than 50% of the real populations.
For each estimator, population, and sampling fraction, we estimated the bias and rmse
by repeatedly drawing a simple random sample from the population, evaluating the estimator, and then computing the error of each estimate. (When evaluating the estimator,
we truncated each estimate below at d and above at N .) The nal estimate of bias was
obtained by averaging the error over all of the experimental replications, and rmse was
20
sample
size
2
5% 0 and < 1
1 and < 50
50
all
10%
0 and < 1
1 and < 50
50
all
20%
0 and < 1
1 and < 50
50
all
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
b uj1
D
13.48
43.81
38.14
70.47
74.11
85.09
30.95
85.09
11.30
39.80
31.41
61.27
63.92
76.47
25.79
76.47
8.89
33.01
23.44
46.77
50.10
62.96
19.62
62.96
b sj1
D
14.20
45.14
39.17
70.48
75.92
88.49
31.91
88.49
12.14
42.32
32.59
61.28
65.88
81.21
26.89
81.21
9.86
37.28
24.81
49.73
52.19
69.06
20.88
69.06
b uj2
D
11.84
39.56
65.34
186.15
388.77
564.57
68.61
564.57
9.05
31.73
90.96
267.08
682.55
1133.61
103.38
1133.61
5.77
29.82
123.00
369.77
1093.07
2010.61
150.28
2010.61
b sj2
D
12.27
39.67
45.25
186.15
77.78
112.13
34.44
186.15
9.71
31.90
38.74
186.15
115.77
281.98
32.94
281.98
6.53
27.49
32.79
186.15
130.30
381.51
29.69
381.51
b Sh D
b Sh2
D
79.17
428.25
54.30
218.02
28.13
47.63
62.33
428.25
33.09
200.79
34.96
107.16
15.50
28.97
32.71
200.79
12.91
79.16
18.14
49.20
7.73
15.12
15.23
79.16
13.23
46.59
36.67
66.82
71.23
83.71
29.86
83.71
11.19
44.83
29.16
54.03
58.82
73.14
24.18
73.14
8.30
30.14
20.88
43.38
42.58
56.72
17.47
56.72
b Sh3
D
202.16
3299.10
93.92
1042.73
21.45
38.58
132.06
3299.10
22.68
131.15
50.17
357.43
11.51
21.81
36.10
357.43
9.05
79.16
17.91
74.99
6.32
10.62
13.44
79.16
^ 2
56.65
96.72
46.51
91.70
74.72
85.55
52.78
96.72
49.68
90.68
38.34
83.12
64.43
76.89
44.93
90.68
40.65
81.03
28.65
67.42
50.51
63.37
35.18
81.03
Table 5: Average and maximum rmse (%) for various estimators.
estimated as the square root of the averaged square error. We used 100 replications, which
was sucient to estimate the rmse with a standard error below 5% in nearly all cases;
typically the standard error was much less.
Summary results from the simulations are displayed in Tables 5 and 6. Table 5 gives the
average and maximum rmse's for each estimator of D over all populations with 0 2 < 1,
with 1 2 < 50, and with 2 50, as well as the average and maximum rmse's for
each estimator over all populations combined. Similarly, Table 6 gives the average and
maximum bias for each estimator. In these tables, the rmse and bias are each expressed as
a percentage of the true number of classes. Tables 5 and 6 also display the rmse and bias
of the estimator ^ 2 (Db uj1 ) used in the second-order jackknife estimators; the rmse and bias
are expressed as a percentage of the true value 2 and are displayed in the column labelled
^2 .
Comparing Tables 5 and 6 indicates that for each estimator the major component of
the rmse is almost always bias, not variance. Thus, even though the standard error can be
estimated as in Section 4, this estimated standard error usually does not give an accurate
picture of the error in estimation of D. Another consequence of the predominance of bias
is that when 2 is large, the rmse for the second-order estimator Db uj2 does not decrease
21
sample
size
2
5% 0 and < 1
1 and < 50
50
all
10%
0 and < 1
1 and < 50
50
all
20%
0 and < 1
1 and < 50
50
all
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
b uj1
D
-12.71
-43.77
-37.95
-70.32
-74.10
-85.09
-30.54
-85.09
-10.93
-39.79
-31.16
-61.00
-63.91
-76.47
-25.51
-76.47
-8.57
-33.01
-23.12
-46.54
-50.09
-62.96
-19.32
-62.96
b sj1
D
-13.43
-45.10
-38.99
-70.32
-75.91
-88.49
-31.51
-88.49
-11.78
-42.31
-32.34
-61.00
-65.87
-81.21
-26.62
-81.21
-9.55
-37.27
-24.49
-49.73
-52.17
-69.06
-20.59
-69.06
b uj2
D
-10.76
-39.51
42.77
186.15
382.88
556.68
47.31
556.68
-8.38
-31.47
74.62
261.47
677.18
1125.89
87.45
1125.89
-4.99
-17.83
112.39
362.12
1087.89
2003.12
140.02
2003.12
b sj2
D
-11.38
-39.62
-16.83
186.15
-22.16
110.28
-15.04
186.15
-9.31
-31.88
-10.44
186.15
24.90
280.63
-7.26
280.63
-6.20
-22.67
-3.41
186.15
60.47
381.51
0.38
381.51
b Sh
D
71.11
427.53
39.13
218.01
22.92
44.65
50.80
427.53
28.12
200.49
25.41
107.16
11.57
27.09
25.44
200.49
9.99
45.86
12.09
49.20
5.03
13.90
10.70
49.20
b Sh2
D
-10.98
-46.59
-36.35
-66.49
-71.22
-83.71
-28.79
-83.71
-9.47
-44.83
-28.61
-53.88
-58.78
-73.13
-23.20
-73.13
-6.75
-28.38
-20.13
-43.38
-42.53
-56.71
-16.45
-56.71
b Sh3
D
90.57
958.74
61.98
663.26
3.17
33.44
69.00
958.74
17.66
130.80
35.20
264.38
3.10
18.47
25.65
264.38
5.73
28.17
10.02
49.34
1.72
8.23
7.65
49.34
^ 2
-55.75
-94.97
-46.19
-91.70
-74.71
-85.54
-52.24
-94.97
-48.87
-90.59
-37.98
-83.12
-64.41
-76.88
-44.41
-90.59
-39.71
-81.00
-28.23
-67.36
-50.49
-63.36
-34.58
-81.00
Table 6: Average and maximum bias (%) for various estimators.
monotonically as the sampling fraction increases. (In all other cases the rmse decreases
monotonically.)
Comparing Db uj1 with Db sj1 and then comparing Db uj2 with Db sj2 , we see that smoothing a
rst-order jackknife estimator never results in a better rst-order estimator. On the other
hand, smoothing a second-order jackknife estimator can result in signicant performance
improvement when 2 is large.
Similarly, using higher-order Taylor expansions leads to mixed results. Second-order
estimators perform better than rst-order estimators when 2 is relatively small, but not
when 2 is large. The diculty is partially that the estimator ^ 2 (Db uj1 ) tends to underestimate 2 when 2 is large, leading to underestimates of the number of classes. Moreover,
the Taylor approximations underlying Db uj1 , Db sj1 , Db uj2 , and Db sj2 are derived under the assumption of not too much variability between class sizes; this assumption is violated when
2 is large. There apparently is no systematic relation between the coecient of skewness
for the class sizes and the performance of second-order jackknife estimators.
As predicted in Sections 3.4 and 4, the estimators Db Sh and Db Sh3 behave poorly when
2
is relatively small, and Db Sh3 performs better than Db Sh when 2 is large. For small to
medium values of 2 , the modied estimator Db Sh2 has a smaller rmse than Db Sh or Db Sh3 , and
22
sample
size
2
5% 0 and < 1
1 and < 50
50
all
10%
0 and < 1
1 and < 50
50
all
20%
0 and < 1
1 and < 50
50
all
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
Average
Maximum
b uj2
D
11.84
39.56
65.34
186.15
388.77
564.57
68.61
564.57
9.05
31.73
90.96
267.08
682.55
1133.61
103.38
1133.61
5.77
29.82
123.00
369.77
1093.07
2010.61
150.28
2010.61
b uj2a
D
19.46
192.64
27.47
54.51
23.00
36.60
23.89
192.64
13.26
120.14
19.22
48.12
17.82
27.30
16.71
120.14
8.12
79.16
17.44
76.57
37.30
83.69
15.20
83.69
b Sh2
D
13.23
46.59
36.67
66.82
71.23
83.71
29.86
83.71
11.19
44.83
29.16
54.03
58.82
73.14
24.18
73.14
8.30
30.14
20.88
43.38
42.58
56.72
17.47
56.72
b Sh3 D
b hybrid
D
202.16
3299.10
93.92
1042.73
21.45
38.58
132.06
3299.10
22.68
131.15
50.17
357.43
11.51
21.81
36.10
357.43
9.05
79.16
17.91
74.99
6.32
10.62
13.44
79.16
11.84
39.56
27.47
54.51
26.17
39.20
21.06
54.51
9.05
31.73
19.55
48.12
11.51
21.81
14.69
48.12
5.77
29.82
17.69
76.57
6.32
10.62
12.00
76.57
Table 7: Average and maximum rmse (%) of Db uj2 , Db uj2a , Db Sh2 , Db Sh3 , and Db hybrid .
its performance is comparable to the generalized jackknife estimators. For extremely large
values of 2 and also for large sample sizes, the estimator Db Sh3 has the best performance
of the three Shlosser-type estimators. (For a 20% sampling fraction, Db Sh3 in fact has the
lowest average rmse of all the estimators considered.)
As indicated earlier, smoothing can improve the performance of the second-order jackknife estimator Db uj2 . An alternative ad hoc technique for improving performance is to
\stabilize" Db uj2 using a method suggested by Chao, Ma, and Yang (1993). Fix c 1 and
remove any class whose frequency in the sample exceeds c; that is, remove from the sample
all members of classes f Cj : j 2 B g, where B = f 1 j D : nj > c g. Then compute
the estimator Db uj2 from the reduced sample and subsequently increment it by jB j to produce the nal estimate, denoted by Db uj2a . (Here jB j denotes the number of elements in
the set
B .) When computing Db uj2 from the reduced sample, take the population size as
P
N , j 2B Nbj , where each Nbj is a method-of-moments estimator of Nj as in Section 3.2.3.
P
If n , j 2B nj = 0, then simply compute Db uj2 from the full sample. The idea behind this
procedure is as follows. When 2 is large, the population consists of a few large classes
and many smaller classes. By in eect removing the largest classes from the population,
23
we obtain a reduced population for which 2 is smaller, so that D is easier to estimate;
the contribution to D from the jB j removed classes is then added back at the nal step
of the estimation process. (We also experimented with another stabilization technique in
which the k most frequent classes are removed for some xed k, but this technique is not
as eective.) Preliminary experiments indicated that c approximately 50 yields the best
performance. For larger values of c, not enough of the frequent classes are removed; for
smaller c, the size of the reduced sample is too small, and the resulting inaccuracy of Db uj2
when computed from this sample osets the benets of the reduction in 2 . We therefore
take c = 50 in our experiments. As can be seen from Table 7, the rmse for Db uj2a is indeed
much lower than that for Db uj2 when 2 exceeds 1. Moreover, by comparing the rmse of Db sj2
and Db uj2a in Tables 5 and 7, respectively, it can be seen that stabilization is more eective
than smoothing. Observe, however, that the performance of Db uj2a is worse than that of
Db uj2 when 2 is small. Interestingly, experiments indicate that none of the other estimators
that we consider appears to benet from stabilization, and we apply this technique only
to Db uj2 . Overall, the most eective estimators appear to be Db uj2a , which has the smallest
average rmse over the various populations, and Db Sh2 , which has the smallest worst-case
rmse.
Our next observation is based on a comparison of the bias and rmse of Db uj1 and Db Sh2
for all of the populations studied. The behavior of the two estimators is quite similar: the
correlation between the bias of the estimators is 0.990 and the correlation between the rmse
is 0.993. The rmse and bias of Db uj1 are usually slightly greater than the rmse and bias,
respectively, of Db Sh2 . On the other hand, using Db Sh2 requires computation of f1 ; f2 ; : : : ; fn ,
whereas using Db uj1 requires computation only of f1. Thus, if computational resources are
limited, then it may be desirable to use Db uj1 as a surrogate for Db Sh2 ; the quantity f1 can be
computed eciently using \Bloom lter" techniques as described by Ramakrishna (1989).
The experimental results show that the relative performance of the estimators is strongly
inuenced by the value of 2 . As can be seen from Table 7, the estimator Db uj2 has the
smallest average rmse when 0 2 < 1, the estimator Db uj2a has the smallest average
rmse when 1 2 < 50, and the estimator Db Sh3 has the smallest average rmse when
2 50. These results indicate that it may be desirable to allow an estimator to depend
explicitly on the (estimated) value of 2 . To illustrate this idea, we consider a simple ad
hoc branching estimator, denoted by Db hybrid . The idea is to estimate 2 by ^ 2 (Db uj1 ), x
parameters 0 < 1 < 2 , and set
8b
>
<Duj2 if 0 ^2 (2Db uj1) < 1 ;
Db hybrid = >Db uj2a if 1 ^ (Db uj1 ) < 2 ;
(33)
:Db Sh3 if ^2 (Db uj1 ) 2 :
Table 7 displays the estimated rmse for Db hybrid when 1 = 0:9 and 2 = 30. As can
be seen, the rmse for the combined estimator Db hybrid almost never exceeds that for Db uj2 ,
Db uj2a , or Db Sh3 separately.
24
7. Conclusions
Both new and previous nonparametric estimators of the number of classes in a nite
population can be viewed as generalized jackknife estimators. This viewpoint has suggested
ways to improve Shlosser's original estimator and has shed new light on certain HorvitzThompson estimators as well as estimators based on notions of \sample coverage." We
have used delta-method arguments to develop estimators of the standard error of generalized
jackknife estimators. As indicated by the example in Section 5, knowledge of the population
size can lead to more precise estimation of the number of classes.
Of the estimators considered, the best appears to be the branching estimator Db hybrid
dened by (33), in which a modied Shlosser estimator is used when the coecient of
variation of the class sizes is estimated to be extremely large and unsmoothed second-order
jackknife estimators are used otherwise. The systematic development of such branching
estimators is a topic for future research. If a nonbranching estimator is desired, then we
recommend the stabilized unsmoothed second-order jackknife estimator Db uj2a , followed by
the modied Shlosser estimator Db Sh2 . If computing resources are scarce, then Db uj1 is a
reasonable estimator.
The various estimators of D discussed in this article embody dierent approaches for
dealing with the diculties caused by variation in the class sizes N1 ; N2 ; : : : ; ND . Such
variation is reected by large values of 2 . First-order estimates simply approximate each
Nj by N . It is well-known in the literature that such an approach tends to yield downwardly biased estimates (see Bunge and Fitzpatrick 1993). More sophisticated approaches
considered here include
Taylor corrections to the rst-order approximation, as in the estimators Db uj2 and Db sj2;
the stabilization technique of Section 6, in which the population is in eect modied
so that the variation in class sizes is reduced;
the Horvitz-Thompson approach, in which the rst-order assumption is avoided by
estimating explicitly each Nj such that nj > 0; and
Shlosser's approach, which replaces the rst-order assumption with the assumption in
(25) and in its purest form results in the estimators Db Sh and Db Sh3 .
The poor performance of the Horvitz-Thompson estimators indicates that approaches based
on direct estimation of the Nj 's are unlikely to be successful. The second-order Taylor
correction is eective mainly for small values of 2 , and both the stabilization technique
and Shlosser's approach are eective mainly for large values of 2 . Thus, until a better
solution is found, the best estimators will result from a judicious combination of the various
approaches considered here.
25
Acknowledgements
Hongmin Lu made substantial contributions to the early phases of the work reported
here, and an anonymous reviewer suggested the \stabilization" technique for Db uj2 used in
Section 6. The Wisconsin student database was graciously provided by Bob Nolan of the
University of Wisconsin-Madison Department of Information Technology (DoIT) and Je
Naughton of the University of Wisconsin-Madison Computer Sciences Department.
26
References
Astrahan, M., Schkolnick, M., and Whang, K. (1987), \Approximating the Number of
Unique Values of an Attribute Without Sorting," Information Systems , 12, 11{15.
Billingsley, P. (1986), Convergence of Probability Measures (2nd ed.), New York: Wiley.
Bishop, Y., Fienberg, S., and Holland, P. (1975), Discrete Multivariate Analysis , Cambridge, MA: MIT Press.
Bunge, J., and Fitzpatrick, M. (1993), \Estimating the Number of Species: A Review,"
Journal of the American Statistical Association , 88, 364{373.
Burnham, K. P., and Overton, W. S. (1978), \Estimation of the Size of a Closed
Population When Capture Probabilities Vary Among Animals," Biometrika , 65,
625{633.
| (1979), \Robust Estimation of Population Size When Capture Probabilities Vary
among Animals," Ecology , 60, 927{936.
Chao, A., and Lee, S. (1992), \Estimating the Number of Classes via Sample Coverage,"
Journal of the American Statistical Association , 87, 210{217.
Chao, A., Ma, M-C., and Yang, M. C. K. (1993), \Stopping Rules and Estimation for
Recapture Debugging With Unequal Failure Rates," Biometrika , 80, 193{201.
Chung, K. L. (1974), A Course in Probability Theory (2nd ed.), New York: Academic
Press.
Deming, W. E., and Glasser, G. J. (1959), \On the Problem of Matching Lists by
Samples," Journal of the American Statistical Association , 54, 403{415.
Flajolet, P., and Martin, G. N. (1985), \Probabilistic Counting Algorithms for Data
Base Applications," Journal of Computer and System Sciences , 31, 182{209.
Gelenbe, E., and Gardy, D. (1982), \On the Sizes of Projections: I," Information
Processing Letters , 14, 18{21.
Good, I. L. (1950), Probability and the Weighing of Evidence , London: Charles Grin.
Goodman, L. A. (1949), \On the Estimation of the Number of Classes in a Population,"
Annals of Mathematical Statistics , 20, 572{579.
| (1952), \On the Analysis of Samples From k Lists," Annals of Mathematical Statistics , 23, 632{634.
Gray, H. L., and Schucany, W. R. (1972), The Generalized Jackknife Statistic , New
York: Marcel Dekker.
Hellerstein, J. M., and Stonebraker, M. (1994), \Predicate Migration: Optimizing
Queries With Expensive Predicates," in Proceedings of the 1994 ACM SIGMOD
International Conference on Managment of Data , pp. 267{276.
27
Heltshe, J. F., and Forrester, N. E. (1983), \Estimating Species Richness Using the
Jackknife Procedure," Biometrics , 39, 1{11.
Holst, L. (1981), \Some Asymptotic Results for Incomplete Multinomial or Poisson
Samples," Scandinavian Journal of Statistics , 8, 243{246.
Hou, W., Ozsoyoglu, G., and Taneja, B. (1988), \Statistical Estimators for Relational
Algebra Expressions," in Proceedings of the Seventh ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems , pp. 276{287.
| (1989), \Processing Aggregate Relational Queries With Hard Time Constraints," in
Proceedings of the 1989 ACM SIGMOD International Conference on Managment
of Data , pp. 68{77.
Kish, L. (1965), Survey Sampling , New York: Wiley.
Knuth, D. E. (1973), The Art of Computer Programming, Vol. 3: Sorting and Searching , Reading, MA: Addison-Wesley.
Korth, H. F., and Silberschatz, A. (1991), Database System Concepts (2nd ed.), New
York: McGraw-Hill.
Miller, R. G. (1974), \The Jackknife{ A Review," Biometrika , 61, 1{17.
Mosteller, F. (1949), \Questions and Answers," American Statistician , 3, 12{13.
Naughton, J. F., and Seshadri, S. (1990), \On Estimating the Size of Projections," in
Proceedings of the Third International Conference on Database Theory , pp. 499{
513.
Ozsoyoglu, G., Du, K., Tjahjana, A., Hou, W., and Rowland, D. Y. (1991), \On Estimating COUNT, SUM, and AVERAGE Relational Algebra Queries," in Database
and Expert Systems Applications, Proceedings of the International Conference in
Berlin, Germany, 1991 (DEXA 91), pp. 406{412.
Ramakrishna, M. V. (1989), \Practical Performance of Bloom Filters and Parallel
Free-Text Searching," Communications of the ACM , 32, 1237{1239.
Sarndal, C.-E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling , New York: Springer-Verlag.
Selinger, P. G., Astrahan, D. D., Chamberlain, R. A., Lorie, R. A., and Price, T. G.
(1979), \Access Path Selection in a Relational Database Management System," in
Proceedings of the 1979 ACM SIGMOD International Conference on Managment
of Data , pp. 23{34.
Sering, R. J. (1980), Approximation Theorems of Mathematical Statistics , New York:
Wiley.
Shlosser, A. (1981), \On Estimation of the Size of the Dictionary of a Long Text on
the Basis of a Sample," Engineering Cybernetics , 19, 97{102.
28
Smith, E. P., and van Belle, G. (1984), \Nonparametric Estimation of Species Richness," Biometrics , 40, 119{129.
Sudman, S. (1976), Applied Sampling , New York: Academic Press.
Vitter, J. S. (1985), \Random Sampling With a Reservoir," ACM Transactions on
Mathematical Software , 27, 703{718.
Whang, K., Vander-Zanden, B. T., and Taylor, H. M. (1990), \A Linear-Time Probabilistic Counting Algorithm for Database Applications," ACM Transactions on
Database Systems , 15, 208{229.
A. Estimators Based On Hypergeometric Probabilities
As in Section 3, denote by nj the number of elements in the sample that belong to
class j for 1 j D. Under the hypergeometric model (1), we have
P f nj = k g
Nj N , Nj N = k
N nn,(nk, 1) (nn, k + 1) (N , n)(N , n , 1) (N , n , N + k + 1) j
= kj N (N , 1) (N , k + 1)
(N , k)(N , k , 1) (N , N + 1)
N q(q , 1 ) (q , k,1 ) !
N
N
= j
j!
N
,
k
,
1
j
(1 , q)(1 , q , N1 ) (1 , q , N )
(1 , Nk )(1 , kN+1 ) (1 , NjN,1 )
k 1(1 , N1 ) (1 , kN,1 )
for 1 j D and 0 k min(n; Nj ), where q = n=N . When N is large relative to Nj , we
have the approximate equality given in (8). That is, P f nj = k g is approximately equal to
the probability that nj = k under the Bernoulli sampling model.
Estimators analogous to those in Section 3 can be derived using the exact hypergeometric
probabilities. The starting point in such a derivation is the pair of identities
P f nj = 0 g = hn (Nj )
and
where
nN P f nj = 1 g = N , nj+ 1 hn,1 (Nj );
( ,(N ,x+1),(N ,n+1)
if x N , n;
0
if x > N , n
for x 0. (By an elementary property of the gamma function,
hn (x) =
,(N ,n,x+1) ,(N +1)
N hn (Nj ) = N ,n Nj
n
29
for 1 j D.) It follows that the optimal value of the parameter K in (4) is given by
PD h (N )
n j
K = PD j =1
:
Nj
h
(
N
)
j =1 N ,n+1 n,1 j
First-order and second-order jackknife estimators can now be derived using arguments parallel to those in Section 3.2. The second-order Taylor approximations use the identity
h0n (x) = ,hn(x)gn (x)
for x > 0 and n 1, where
gn (x) =
n
X
1
:
N
,
x
,
n
+
k
k=1
The estimators analogous to those in Section 3.2 are
,1
bDuj1 = dn , f1 1 , (N , n + 1)f1 ;
n
nN
!
(N , N~ , n + 1)f1 ,1
2 (D
~
~
b
(
N
,
N
,
n
+
1)
g
(
N
)^
)
f
n
,
1
uj1
1
dn +
;
Db uj2 = 1 ,
and
nN
n
Db sj2 = (1 , hn(N~ )),1 dn + N ^2 (Db uj1 )gn,1 (N~ )hn (N~ ) ;
where N~ = N=Db uj1 and
^2 (D) = max
n
X
(
N
,
1)
D
D , 1:
0; Nn(n , 1) i(i , 1)fi + N
i=1
Moreover, the smoothed rst-order jackknife estimator Db sj1 is dened as the value of Db
that solves the equation
,
Db 1 , hn (N=Db ) = dn:
Finally, Horvitz-Thompson estimators can be derived in a manner similar to that in Section 3.2.3. For each j such that nj > 0, dene the method-of-moments estimator Nbj of Nj
as the value of Nb that solves the equation
nj =
qNb :
1 , hn (Nb )
30
P
Then dene the estimator of b(g) of (g) = Dj=1 g(Nj ) as
b(g) =
X
g(Nbj ) :
b
fj :nj >0g 1 , hn (Nj )
We compared the estimators based on the Bernoulli approximation (8) against the estimators based on the exact hypergeometric probabilities, using the populations described
in Section 6. The error induced by the approximation (8) turned out to be less than 1% in
all cases.
The derivation of Db uj2 and Db sj2 using the exact hypergeometric probabilities assumes
that Nj N , n for 1 j D. Without this assumption, Taylor approximations of
hn (Nj ) fail because hn is not continuous, and the subsequent derivation for each estimator
is inappropriate. We conclude by providing a technique for modifying Db uj2 and Db sj2 to deal
with this problem. For concreteness, we focus on the unsmoothed second-order jackknife
estimator Db uj2 . Denote by J the set of indices of the \big" classes: J = f j : Nj > N , n g.
(Observe that if n < N=2, then J can contain at most one element.) If j 2 J , then with
probability 1 class j is represented in the sample. We can decompose D according to
D = jJ j + j f 1; 2; : : : ; D g , J j:
(34)
The rst term on the right side of (34) is the number of big classes, and the second term
represents the number of classes in the reduced population that is formed by removing the
big classes. We can estimate jJ j by the number of elements in the set Jb = f j : Nbj > N , n g,
where Nbj is a method-of-moments estimator of Nj dened as the numerical solution of the
equation E [nj j nj > 0] = nj (cf. Sec. 3.2.3). Since we assume a hypergeometric sampling
model, Nbj is dened more precisely as the solution of the equation
Nbj (n=N ) = n :
j
1 , hn (Nbj )
To estimate the remaining term in (34), apply the unsmoothed second-order jackknife
estimator
to the reducedPpopulation obtained by removing the classes in Jb. Set N =
P
N , j 2Jb Nbj , n = n , j 2Jb nj , dn = dn , jJbj, f1 = jf j : nj = 1 and j 62 Jb gj, and
, n + 1)f ,1
(
N
= d , f1
1
bDuj1
1,
:
n n
n N (Observe that Nbj = 1, and hence j 62 Jb, whenever nj = 1, so that f1 = f1 .) The modied
version of Db uj2 is then given by
,1
= jJbj+ 1 , (N , N~ , n + 1)f1
bDuj2
n N ~
~ 2 b !
dn +
(N , N , n + 1)gn ,1 (N )^ (Duj1 )f1
n
31
;
. If nj = n for some j , then Nbj = N and D
= 1. If Nbj > N , n
b uj2
where N~ = N =Db uj1
for some j 62 Jb, then the foregoing process can be repeated. Similar modications can be
made to the estimators Db sj1 and Db sj2 .
B. Derivation of Db uj1 and Db uj2 Based on Sample Coverage
For a nite population of size N , the sample coverage is dened as
C=
D
X
j =1
(Nj =N )I [nj > 0]:
Using the approximation in (8), we have to rst order
E [C ] =
D N
X
j
j =1
N P f nj > 0 g N
D N ,
X
j
j =1
Nj
N
N 1 , (1 , q) 1 , (1 , q) :
(35)
Similarly, E [dn ] D 1 , (1 , q) , so that D E [dn ] =E [C ]. Observe that by (35) and
(13),
E [C ] 1 , (1 , q)E [f1 ] :
n
The foregoing relations suggest the method-of-moments estimator Db = dn =Cb, where
Cb = 1 , (1 , q)f1 :
n
This estimator is identical to Db uj1 .
To derive a second-order estimator, use a Taylor approximation as in Section 3.2.2 to
obtain
E [C ] D N ,
X
j
j =1
Nj
N 1 , (1 , q)
1 , N1
D
X
j =1
Nj (1 , q)N + (1 , q)N ln(1 , q)(Nj , N )
= 1 , (1 , q)N , (1 , q)N ln(1 , q)N 2 ;
where 2 is the squared coecient of variation of N1 ; N2 ; : : : ; ND . It follows that
!
D 1 , (1 , q)N
N ln(1 , q)N 2
(1
,
q
)
E [dn ] ;
E [C ] 1 , (1 , q)N , (1 , q)N ln(1 , q)N 2 D 1 +
1 , (1 , q)N
32
and hence
2
N
2
D EE [[dCn]] , (1 , q) ln(1 , Nq)N E 1[C ] E [dn] , E [f1] (1 , q)nln(1 , q)N
1 , (1 , q)
(36)
where we have used the relations (35) and (13). Dene ^ 2 as in (16). Estimating E [dn ] by
dn, E [f1 ] by f1 , 2 by ^ 2 (Db uj1 ), and E [C ] by Cb in (36), we obtain the formula for Db uj2 .
C. Asymptotic Variance
In this appendix we study the asymptotic variance of an estimator Db as D becomes
large. Consider an innite sequence C1 ; C2 ; : : : of classes with corresponding class sizes
N1; N2 ; : : : and construct a sequence of increasing populations in which the Dth population
comprises classes C1 ; C2 ; : : : ; CD . As in (8), approximate the hypergeometric sample design
by a Bernoulli sample design. Although the population size N depends on D, as does each
sample statistic fi, we suppress this dependence in our notation. Suppose that there exists
a nite, positive integer M and a positive real number such that
Nj M
(37)
for j 1 and
!
D
p N p 1X
lim D D , = Dlim
D D (Nj , ) = 0:
D!1
!1
i=1
(38)
Also suppose that there exists a nonnegative vector = (1 ; 2 ; : : : ; M ) 6= 0 and a nonnegative symmetric matrix = ki;i0 k 6= 0 such that
0 D
1
p E [fi] p @1 X
D D (j;i , i)A = 0;
lim D D , i = Dlim
!1
D!1
j =1
and
(39)
D
X
Var
[
f
]
1
i
lim
= Dlim
(1 , j;i) = i;i;
D!1 D
!1 D j =1 j;i
(40)
D
Cov [fi; fi0 ] = lim , 1 X
lim
j;i j;i0 = i;i0
D!1
D!1 D
D
(41)
for 1 i; i0 M , where
j =1
j;i = Nij qi(1 , q)Nj ,i :
33
The conditions in (37){(41) are satised, for example, when the class-size sequence N1 ; N2 ; : : :
is of the form I1 ; I2 ; : : : ; Ir ; I1 ; I2 ; : : : ; Ir ; : : : , where I1 ; I2 ; : : : ; Ir are xed nonnegative integers; in eect, this sequence of populations is obtained from an initial population by
uniformly scaling up the initial Fi 's.
As in Section 4, suppose that the estimator Db is a function of the sample only through
f = (f1; f2; : : : ; fM ) and satises the condition in (31). Also suppose that the dierentiability assumption in Section 4 holds, so that Db is continuously dierentiable at the point
(; ).
P
Write fi = Dj=1 I [nj = i] for 1 i M and observe that, under the foregoing
assumptions, each fi is the sum of D independent (but not identically distributed) Bernoulli
random variables. An application of Theorem 5.1.2 in Chung (1974) followed by (39) shows
that
lim (f ; N ) = (; )
(42)
D!1
with probability 1, where f = f =D and N = N=D. Similarly, since (39){(41) hold by
assumption, an application of Theorem B in Sering
p (1980, Sec. 1.9.2) and then Slutsky's
Theorem (see Sering 1980, Sec. 1.5.4) shows that D(f , ) ) N(0; ) as D ! 1, where
\)" denotes convergence in distribution and N denotes a multivariate normal random
variable. It then follows from (38) and Theorem 4.4 in Billingsley (1986) that
p ,
,
D (f ; N ) , (; ) ) N(0; ); 0 :
Since Db is assumed dierentiable at (; ), an application of the Delta Method (see Bishop,
Fienberg, and Holland 1975, Sec. 14.6) shows that
p ,b
D D(f ; N ) , Db (; ) ) N (0; Bt B)
(43)
as D ! 1, where B = r1 Db (; ), r1 Db denotes the gradient of Db (u; k) with respect to
u, and N (0; Bt B) is a univariate normal random variable. Using (31) we can rewrite the
foregoing limit as
p1 ,Db (f ; N ) , Db (D; D) ) N (0; Bt B);
D
so that the asymptotic variance of Db (f ; N ) is equal to (Bt B)D.
To approximate this asymptotic variance, set A = r1 Db (f ; N ) and let C = C(f ) be
the covariance matrix of the random vector f . It follows from (31) that r1 Db (cu; ck) =
r1Db (u; k) for any c; k > 0 and nonnegative M -dimensional vector u. Thus,
lim A = Dlim
r Db (f ; N ) = Dlim
r Db (f ; N ) = r1Db (; ) = B
!1 1
!1 1
D!1
34
(44)
Name
DB01
DB02
DB03
DB04
DB05
DB06
DB07
DB08
DB09
DB10
DB11
DB12
DB13
DB14
DB15
DB16
DB17
DB18
DB19
DB20
N
D
2
15469
15469 0.00
1288928 1288928 0.00
624473 624473 0.00
597382 591564 0.01
113600 110074 0.04
621498 591564 0.05
1341544 1288927 0.05
1547606
51168 0.23
1547606
3 0.38
147811 110076 0.47
113600
3 0.70
173805 109688 0.93
1463974 624472 0.94
1654700 624473 1.13
633756 202462 1.19
597382 437654 1.53
931174 110076 1.63
931174
29 3.22
1547606
33 3.33
1547606
194 3.35
Skew Name
N
D
2
Skew
0.00 DB21
15469
131 3.76
3.79
0.00 DB22 624473
168 3.90
3.06
0.00 DB23 1547606
21 6.30
3.26
17.61 DB24 1547606
49 6.55
2.70
6.87 DB25 1463974 535328 7.60 639.24
4.70 DB26 1547606
909 7.99
7.70
9.95 DB27 1463974
10 8.12
2.66
0.24 DB28 931174
73 12.96
6.43
-0.67 DB29 597382
17 14.27
3.73
7.41 DB30 633756 221480 15.68 454.61
0.08 DB31 633756
213 16.16
7.36
4.84 DB32 173805
72 16.98
7.14
4.77 DB33 931174
398 19.70
7.89
4.38 DB34 113600
6155 24.17 54.66
3.53 DB35 1654700
235 30.85 10.35
114.62 DB36 173805
61 31.71
7.04
4.51 DB37 1341544
37 33.03
5.82
4.29 DB38 147811
62 34.68
7.22
1.66 DB39 1463974
233 37.75 11.06
2.97 DB40 624473 14047 81.63 69.00
Table 8: Characteristics of \database" populations
with probability 1, where the third equality follows by (42) and the assumed continuity of
r1Db . Using (40), (41), and (44), we nd that
A CA = 1
lim
D!1 Bt (D)B
t
with probability 1, and the asymptotic variance of Db can be approximated by At CA.
D. Detailed Experimental Results
This section contains further details about the experiments described in Section 6. Table 8 displays characteristics of the \database" populations used in the experiments. The
printouts on the following pages contain simulation results for all of the estimators and
for each experimental population. In the printouts, \Psize" denotes the population size,
\Nclass" denotes the number of classes in the population, and \gm2hat" denotes the estimator ^ 2 (Db uj1 ) used to estimate 2 in the second-order jackknife estimators.
35
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
-12.71
-43.77
Duj1
0.00
0.00
0.00
0.10
0.83
-1.27
-4.31
-3.35
-4.32
-4.23
-1.57
-10.89
-22.87
-10.54
0.00
-28.55
-30.75
-30.33
-28.54
-43.77
-42.46
-46.24
-47.79
-35.18
-43.84
-59.61
-4.86
-9.00
-4.81
-19.87
-12.13
-8.90
-23.52
-36.72
-15.75
-37.30
-24.68
-51.11
-27.57
-53.16
-44.38
-54.02
-61.03
-49.14
-68.09
-70.32
-66.47
-49.24
-62.37
-75.27
-73.68
-85.09
-38.99
-70.32
-13.43
-45.10
Dsj1
0.00
0.00
0.00
0.10
0.82
-1.30
-4.39
-3.41
-4.40
-4.32
-2.59
-12.77
-23.23
-13.56
0.00
-29.46
-31.75
-30.33
-32.10
-45.10
-44.34
-48.36
-49.85
-39.18
-48.86
-61.38
-4.86
-9.00
-4.82
-22.41
-12.15
-8.90
-23.53
-38.84
-15.84
-37.30
-24.70
-51.12
-30.89
-53.20
-44.42
-54.05
-65.59
-49.16
-68.11
-70.32
-66.50
-49.26
-64.81
-75.54
-74.80
-88.49
382.88
556.68
42.77
186.15
-10.76
-39.51
Duj2
0.00
0.00
0.00
0.13
1.97
-1.17
-4.06
-3.16
-4.10
-3.93
-0.20
-7.44
-21.74
-4.77
0.00
-25.48
-27.83
-28.27
-20.02
-39.51
-36.34
-39.39
-41.36
3.48
-24.72
-54.50
0.47
2.45
8.92
35.93
8.17
14.88
35.95
172.75
43.50
18.93
108.93
31.84
186.15
28.46
34.32
14.34
115.69
132.99
39.91
39.74
64.00
172.90
426.37
242.22
556.68
306.23
-14.31
-54.50
-19.57
-33.52
-18.33
-54.50
-8.14
-39.51
Duj2a
0.00
0.00
0.00
0.13
1.97
-1.17
-4.06
-3.16
-4.10
-3.93
-0.20
-7.44
-21.74
-4.77
0.00
-25.48
-27.83
26.67
-20.02
-39.51
-36.34
-39.39
-41.36
3.48
-24.72
-54.50
27.93
-6.26
-0.64
21.21
-4.96
-4.15
-7.84
-32.01
-6.87
-16.75
5.57
-25.49
-22.81
-25.68
-29.41
-40.71
-32.23
-23.09
-29.89
-27.22
-33.42
-23.60
-17.84
-33.52
-5.17
-21.75
-15.04
186.15
-22.16
110.28
-16.83
186.15
-11.38
-39.62
Dsj2
0.00
0.00
0.00
0.12
1.96
-1.18
-4.06
-3.16
-4.10
-3.93
-1.97
-8.89
-21.76
-8.79
0.00
-25.53
-27.95
-30.33
-23.26
-39.62
-36.56
-39.70
-41.73
3.11
-27.82
-54.93
-4.86
-9.00
-4.82
-21.50
-12.15
-8.90
-23.53
172.67
-15.84
-37.30
-24.70
-51.12
186.15
-53.20
-44.42
-54.05
16.66
-49.16
-68.11
-70.32
-66.50
-49.26
-48.66
-75.54
-74.74
110.28
50.80
427.53
22.92
44.65
39.13
218.01
71.11
427.53
DSh
0.00
0.00
-0.00
12.00
427.53
0.81
3.50
2.69
4.31
3.42
33.14
252.82
61.36
150.37
0.00
25.88
140.58
26.67
222.19
38.67
87.27
98.96
117.75
25.81
218.01
96.75
26.02
-1.79
5.30
74.24
3.52
0.99
28.21
115.19
6.10
13.11
41.47
8.30
131.99
-1.47
-14.83
-26.37
71.92
3.21
3.25
10.27
-1.61
2.26
31.33
3.48
44.65
12.22
-28.79
-83.71
-71.22
-83.71
-36.35
-66.49
-10.98
-46.59
DSh2
-0.00
-0.00
-0.00
0.10
0.93
0.81
3.50
2.69
4.31
3.42
-1.53
-11.02
-16.55
-10.57
0.00
-32.32
-29.48
-27.62
-28.79
-46.59
-41.79
-46.91
-47.98
-32.33
-44.26
-59.32
-3.39
-8.66
-4.34
-17.84
-11.41
-8.43
-21.07
-33.04
-14.80
-34.90
-21.55
-48.29
-28.02
-50.73
-43.01
-52.73
-60.40
-46.67
-64.72
-66.49
-63.41
-46.81
-60.27
-71.78
-69.11
-83.71
69.00
958.74
3.17
33.44
61.98
663.26
90.57
958.74
DSh3
-0.00
0.00
0.00
2.43
340.21
0.76
3.28
2.54
4.08
3.23
5.96
117.91
56.81
49.40
0.00
23.45
123.23
958.74
101.73
33.21
75.06
82.59
95.45
22.83
140.08
68.93
663.26
-5.37
3.96
15.94
0.04
-1.64
19.34
100.92
-4.20
33.20
115.70
19.54
119.35
18.52
-16.30
-26.08
2.50
18.10
74.30
65.82
28.43
18.18
-18.40
-0.05
33.44
-2.31
116.06
487.54
131.25
183.86
128.39
295.35
97.31
487.54
DHTj
0.00
0.00
0.00
19.45
487.54
0.85
3.63
2.79
4.45
3.55
77.76
386.55
64.26
289.71
0.00
27.48
152.26
26.67
358.24
42.40
95.91
110.89
134.26
28.09
295.35
118.32
28.37
54.99
54.42
247.70
72.16
61.76
160.92
125.13
118.14
157.30
200.83
153.49
140.80
141.56
98.23
78.73
206.19
146.53
121.53
133.29
131.95
145.56
183.86
116.23
169.56
55.36
144.25
542.21
160.16
231.50
155.66
360.32
126.55
542.21
DHTsj
0.00
0.00
0.00
62.52
542.17
0.87
3.73
2.85
4.55
3.63
163.84
542.21
66.38
436.00
0.00
28.53
160.71
26.67
466.70
44.77
101.35
118.25
144.71
29.21
335.89
132.15
28.61
66.41
81.22
360.32
108.39
71.26
217.77
132.05
161.16
168.50
248.43
174.30
147.32
172.69
124.28
102.37
254.04
183.96
147.01
151.62
158.39
182.41
231.50
145.22
200.36
63.57
-39.64
-88.60
-75.94
-88.60
-43.74
-71.81
-27.44
-69.15
Dhj
0.00
0.00
0.00
-0.34
-26.66
-13.51
-25.00
-26.83
-32.89
-31.26
-2.95
-18.76
-59.37
-15.81
0.00
-62.42
-58.45
-30.33
-36.67
-69.15
-65.94
-66.70
-66.14
-64.76
-55.92
-71.81
-4.86
-9.00
-4.82
-22.42
-12.15
-8.90
-23.53
-62.27
-15.84
-37.30
-24.70
-51.12
-57.85
-53.20
-44.42
-54.05
-66.00
-49.16
-68.11
-70.32
-66.50
-49.26
-64.82
-75.54
-74.80
-88.60
-14.16
-54.50
-3.92
33.44
-18.33
-54.50
-10.76
-39.51
Hybrid
0.00
0.00
0.00
0.13
1.97
-1.17
-4.06
-3.16
-4.10
-3.93
-0.20
-7.44
-21.74
-4.77
0.00
-25.48
-27.83
-28.27
-20.02
-39.51
-36.34
-39.39
-41.36
3.48
-24.72
-54.50
27.93
-6.26
-0.64
21.21
-4.96
-4.15
-7.84
-32.01
-6.87
-16.75
5.57
-25.49
-22.81
-25.68
-29.41
-40.71
-32.23
-23.09
-29.89
-27.22
-33.42
-23.60
-18.19
-26.99
33.44
-3.94
-52.24
-94.97
-74.71
-85.54
-46.19
-91.70
-55.75
-94.97
gm2hat
0.00
0.00
0.00
0.75
1.41
-92.73
-93.62
-94.40
-94.96
-93.14
-10.81
-57.99
-94.97
-39.92
-0.03
-90.37
-90.82
-73.51
-66.92
-91.20
-87.53
-87.23
-87.95
-58.85
-70.66
-91.70
-6.19
-11.74
-6.23
-26.47
-15.21
-10.33
-27.13
-41.33
-17.76
-41.90
-26.64
-54.70
-28.81
-56.60
-47.18
-56.75
-63.23
-50.65
-70.31
-72.45
-68.45
-50.55
-63.03
-75.97
-74.29
-85.54
Performance Measure: Bias (%)
Sample size: 5.0%
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
3.20
17.61
-37.95
-70.32
-75.91
-88.49
47.31
556.68
(50
<= gamma2 < inf)
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
51.27
639.24
-74.10
-85.09
-31.51
-88.49
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Average:
Maximum:
45.14
73.54
-30.54
-85.09
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
50)
Average:
Maximum:
31.39
639.24
(1.0 <= gamma2 <
Average:
Maximum:
36
37
(50
<= gamma2 < inf)
50)
45.14
73.54
31.39
639.24
Average:
Maximum:
51.27
639.24
3.20
17.61
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
Average:
Maximum:
Average:
Maximum:
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
(1.0 <= gamma2 <
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
Performance Measure: RMS Error (%)
Sample size: 5.0%
30.95
85.09
74.11
85.09
38.14
70.47
13.48
43.81
Duj1
0.00
0.00
0.00
0.65
7.02
1.40
8.21
3.96
4.46
4.27
1.72
10.89
23.76
10.64
0.00
28.66
30.84
31.80
28.63
43.81
42.46
46.24
47.79
35.20
43.84
59.61
5.31
9.38
4.95
20.03
12.21
9.63
23.87
36.72
15.77
38.33
24.98
51.75
27.59
53.21
44.49
54.03
61.03
49.19
68.21
70.47
66.59
49.29
62.37
75.30
73.69
85.09
31.91
88.49
75.92
88.49
39.17
70.48
14.20
45.14
Dsj1
0.00
0.00
0.00
0.63
6.84
1.43
8.30
4.02
4.53
4.36
2.67
12.77
24.12
13.63
0.00
29.57
31.83
31.80
32.16
45.14
44.34
48.36
49.86
39.19
48.86
61.38
5.31
9.38
4.96
22.53
12.24
9.63
23.87
38.85
15.86
38.33
25.00
51.75
30.90
53.24
44.52
54.07
65.60
49.20
68.23
70.48
66.62
49.31
64.81
75.57
74.81
88.49
68.61
564.57
388.77
564.57
65.34
186.15
11.84
39.56
Duj2
0.00
0.00
0.00
0.67
7.97
1.31
8.02
3.79
4.23
3.97
0.88
7.45
22.67
5.24
0.00
25.66
27.94
32.53
20.32
39.56
36.35
39.40
41.37
6.86
24.74
54.51
8.19
10.21
10.14
38.42
10.37
29.24
43.56
172.77
44.00
68.99
115.76
80.45
186.15
35.61
48.98
19.77
116.75
138.32
71.65
78.77
89.35
179.98
426.73
255.60
564.57
308.17
23.89
192.64
23.00
36.60
27.47
54.51
19.46
192.64
Duj2a
0.00
0.00
0.00
0.67
7.97
1.31
8.02
3.79
4.23
3.97
0.88
7.45
22.67
5.24
0.00
25.66
27.94
192.64
20.32
39.56
36.35
39.40
41.37
6.86
24.74
54.51
49.23
7.75
2.41
23.87
5.98
9.14
12.99
32.01
7.09
36.54
14.86
40.61
22.85
27.61
31.49
40.93
32.31
24.53
41.81
44.96
40.49
25.30
17.92
36.60
14.66
22.80
34.44
186.15
77.78
112.13
45.25
186.15
12.27
39.67
Dsj2
0.00
0.00
0.00
0.64
7.86
1.31
8.02
3.79
4.24
3.97
2.09
8.90
22.69
8.96
0.00
25.70
28.06
31.80
23.46
39.67
36.57
39.71
41.74
6.64
27.84
54.94
5.31
9.38
4.96
21.65
12.24
9.63
23.87
172.70
15.86
38.33
25.00
51.75
186.15
53.24
44.52
54.07
18.42
49.20
68.23
70.48
66.62
49.31
48.67
75.57
74.75
112.13
62.33
428.25
28.13
47.63
54.30
218.02
79.17
428.25
DSh
0.00
0.00
0.00
12.99
428.25
0.82
3.55
2.69
4.31
3.42
33.44
252.83
61.41
150.80
0.00
25.89
140.61
192.64
222.91
38.68
87.27
98.96
117.76
25.81
218.02
96.77
46.24
7.98
6.69
76.75
6.65
13.29
35.90
115.19
6.81
65.06
47.64
57.13
131.99
15.56
22.44
27.23
72.11
13.83
39.12
52.61
33.69
14.77
31.43
20.53
47.63
12.92
29.86
83.71
71.23
83.71
36.67
66.82
13.23
46.59
DSh2
0.00
0.00
0.00
0.65
7.35
0.82
3.55
2.69
4.31
3.42
1.69
11.02
18.96
10.67
0.00
32.33
30.07
33.06
28.88
46.59
41.79
46.91
47.98
32.33
44.26
59.32
5.04
9.11
4.52
18.05
11.51
9.35
21.59
33.04
14.82
36.66
22.04
49.36
28.02
50.81
43.16
52.75
60.41
46.73
64.95
66.82
63.63
46.88
60.27
71.83
69.13
83.71
132.06
3299.10
21.45
38.58
93.92
1042.73
202.16
3299.10
DSh3
0.00
0.00
0.00
2.76
341.21
0.77
3.38
2.55
4.09
3.23
6.13
117.92
56.88
49.75
0.00
23.47
123.27
3299.10
102.49
33.23
75.07
82.59
95.46
22.84
140.10
68.96
1042.73
7.73
5.68
18.83
5.27
11.88
28.85
100.92
4.73
114.18
129.57
78.89
119.36
31.78
26.52
27.35
4.29
27.63
117.57
124.91
67.58
29.64
18.47
23.60
38.58
5.15
127.09
488.14
133.63
183.95
142.86
295.36
105.57
488.14
DHTj
0.00
0.00
0.00
23.10
488.14
0.85
3.67
2.79
4.45
3.55
79.09
386.56
64.29
290.68
0.00
27.48
152.27
192.64
359.19
42.41
95.91
110.89
134.27
28.09
295.36
118.33
49.45
73.93
59.44
255.23
78.15
93.95
178.57
125.13
119.48
240.99
211.36
218.24
140.80
148.75
115.88
82.94
206.36
151.93
147.06
166.86
153.62
152.03
183.95
123.22
171.82
55.54
154.46
542.60
162.00
231.56
168.64
365.49
134.79
542.60
DHTsj
0.00
0.00
0.00
66.77
542.60
0.87
3.75
2.86
4.55
3.63
164.87
542.22
66.40
436.68
0.00
28.53
160.72
192.64
467.39
44.78
101.35
118.25
144.72
29.21
335.89
132.16
49.67
84.79
85.03
365.49
112.72
103.58
231.67
132.05
162.06
250.47
256.76
234.39
147.32
178.34
138.43
105.50
254.18
188.20
167.64
180.87
176.40
187.37
231.56
150.60
202.10
63.72
39.94
88.60
75.95
88.60
43.93
71.81
27.96
69.16
Dhj
0.00
0.00
0.00
0.69
26.94
13.84
32.29
27.65
32.98
31.29
3.02
18.76
59.45
15.85
0.00
62.43
58.46
31.80
36.71
69.16
65.94
66.70
66.14
64.77
55.92
71.81
5.31
9.38
4.96
22.54
12.24
9.63
23.87
62.27
15.86
38.33
25.00
51.75
57.85
53.24
44.52
54.07
66.01
49.20
68.23
70.48
66.62
49.31
64.82
75.57
74.81
88.60
21.06
54.51
26.17
39.20
27.47
54.51
11.84
39.56
Hybrid
0.00
0.00
0.00
0.67
7.97
1.31
8.02
3.79
4.23
3.97
0.88
7.45
22.67
5.24
0.00
25.66
27.94
32.53
20.32
39.56
36.35
39.40
41.37
6.86
24.74
54.51
49.23
7.75
2.41
23.87
5.98
9.14
12.99
32.01
7.09
36.54
14.86
40.61
22.85
27.61
31.49
40.93
32.31
24.53
41.81
44.96
40.49
25.30
18.26
39.20
38.58
8.65
52.78
96.72
74.72
85.55
46.51
91.70
56.65
96.72
gm2hat
0.00
0.00
0.00
1.37
2.65
93.03
96.72
94.60
94.98
93.20
14.45
58.02
95.12
40.58
0.51
90.47
90.86
77.15
67.18
91.23
87.53
87.24
87.95
59.20
70.67
91.70
6.83
12.25
6.43
27.14
15.35
11.20
27.52
41.98
17.79
43.05
26.96
55.38
30.19
56.65
47.31
56.77
63.30
50.70
70.43
72.61
68.57
50.60
63.04
75.99
74.31
85.55
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
-10.93
-39.79
Duj1
0.00
0.00
0.00
0.00
-0.03
-1.16
-3.87
-2.93
-4.11
-3.89
-0.19
-6.38
-21.96
-4.93
0.00
-26.35
-28.01
-28.00
-20.51
-39.79
-37.52
-40.58
-42.05
-29.84
-32.67
-54.07
-3.28
-6.48
-2.45
-11.17
-8.49
-6.14
-15.57
-32.80
-10.08
-29.40
-14.38
-40.76
-24.17
-43.73
-37.61
-47.94
-49.84
-40.16
-58.77
-61.00
-57.91
-40.07
-51.27
-65.65
-62.26
-76.47
-32.34
-61.00
-11.78
-42.31
Dsj1
0.00
0.00
0.00
0.00
0.04
-1.22
-4.01
-3.05
-4.26
-4.06
-0.32
-8.24
-22.68
-6.65
0.00
-27.91
-29.87
-28.00
-24.22
-42.31
-40.72
-44.21
-45.68
-33.79
-39.35
-57.54
-3.28
-6.48
-2.45
-11.89
-8.50
-6.14
-15.57
-35.43
-10.12
-29.40
-14.38
-40.76
-26.71
-43.76
-37.64
-47.96
-54.19
-40.17
-58.79
-61.00
-57.94
-40.09
-53.09
-65.89
-63.29
-81.21
677.18
1125.89
74.62
261.47
-8.38
-31.47
Duj2
0.00
0.00
0.00
0.00
0.48
-0.97
-3.46
-2.57
-3.68
-3.33
0.05
-3.81
-19.92
-1.37
0.00
-20.90
-22.68
-24.68
-11.34
-31.47
-26.42
-27.98
-30.19
19.12
-2.27
-43.85
4.88
3.29
5.41
21.14
6.00
17.17
40.04
173.44
40.71
46.62
132.39
110.46
186.15
83.32
73.71
49.73
202.17
201.07
138.09
151.05
151.60
261.47
665.09
536.21
1125.89
381.51
-5.76
-43.85
7.78
25.86
-7.27
-43.85
-6.40
-31.47
Duj2a
0.00
0.00
0.00
0.00
0.48
-0.97
-3.46
-2.57
-3.68
-3.33
0.05
-3.81
-19.92
-1.37
0.00
-20.90
-22.68
17.00
-11.34
-31.47
-26.42
-27.98
-30.19
8.02
-2.29
-43.85
18.42
-2.82
0.05
3.26
-4.04
-1.40
-3.34
-25.55
-1.83
-6.94
18.59
-2.87
-19.43
-2.73
-15.99
-28.34
-6.56
-1.29
-7.88
1.13
-9.94
-0.52
-3.78
-8.82
17.84
25.86
-7.26
280.63
24.90
280.63
-10.44
186.15
-9.31
-31.88
Dsj2
0.00
0.00
0.00
0.00
0.50
-0.98
-3.47
-2.57
-3.68
-3.34
-0.31
-6.36
-20.01
-5.64
0.00
-21.06
-23.12
-28.00
-18.34
-31.88
-27.20
-29.09
-31.53
18.35
-11.42
-45.63
-3.28
-6.48
-2.45
-11.88
-8.50
-6.14
-15.57
173.41
-10.12
-29.40
-14.38
-40.76
186.15
-43.76
-37.64
-47.96
-5.75
-40.17
-58.79
-61.00
-57.94
-40.09
-51.85
-65.89
-63.29
280.63
25.44
200.49
11.57
27.09
25.41
107.16
28.12
200.49
DSh
0.00
0.00
-0.00
0.02
200.49
0.67
2.96
2.28
3.64
2.86
1.42
39.52
48.24
20.32
0.00
19.43
95.48
17.00
51.05
25.75
59.49
63.90
70.90
19.54
107.16
43.60
20.15
1.73
2.12
12.11
-0.59
3.52
7.70
78.56
5.82
20.36
32.16
26.59
95.08
13.88
-2.69
-16.20
24.00
10.16
7.52
24.43
3.37
11.13
12.19
1.31
27.09
5.67
-23.20
-73.13
-58.78
-73.13
-28.61
-53.88
-9.47
-44.83
DSh2
-0.00
0.00
0.00
0.00
-0.11
0.67
2.96
2.28
3.64
2.86
-0.17
-6.26
-14.64
-4.67
0.00
-31.32
-25.33
-23.91
-20.17
-44.83
-39.85
-37.53
-40.22
-31.23
-32.71
-53.88
-1.15
-5.74
-2.04
-9.71
-7.78
-5.26
-13.46
-32.21
-8.67
-24.88
-10.15
-34.64
-26.43
-38.52
-34.46
-45.08
-47.45
-35.59
-52.76
-53.23
-52.36
-35.43
-47.15
-59.78
-55.07
-73.13
25.65
264.38
3.10
18.47
35.20
264.38
17.66
130.80
DSh3
0.00
-0.00
-0.00
0.00
130.80
0.59
2.62
2.03
3.26
2.53
0.04
8.58
41.33
1.78
0.00
15.77
73.83
17.00
8.04
18.33
44.43
45.04
46.85
15.77
54.90
17.24
264.38
8.73
2.46
1.13
-0.87
10.63
-3.43
60.22
6.92
47.47
41.65
38.63
77.14
41.49
23.94
-2.31
-8.44
31.16
16.37
62.39
17.50
33.48
-3.77
-1.45
18.47
-0.86
55.64
253.36
69.23
92.74
65.79
172.39
40.01
253.36
DHTj
0.00
0.00
0.00
0.10
253.36
0.73
3.18
2.44
3.89
3.08
4.50
83.31
52.94
58.09
0.00
22.06
111.81
17.00
120.48
31.51
71.70
79.95
92.06
22.62
172.39
69.41
20.50
19.42
12.98
46.28
19.90
27.72
58.98
92.61
39.86
80.06
83.82
114.30
107.92
76.73
52.30
34.93
100.57
64.91
71.51
82.52
64.90
67.27
86.59
63.76
92.74
33.84
70.56
317.77
87.89
114.95
81.72
208.23
52.92
317.77
DHTsj
0.00
0.00
0.00
0.32
317.77
0.77
3.34
2.56
4.07
3.23
13.48
146.14
56.59
104.97
0.00
23.81
124.32
17.07
178.68
35.00
79.30
89.58
105.01
24.38
208.23
84.07
21.40
24.08
21.15
85.67
31.67
31.70
83.42
102.76
59.26
92.30
114.22
127.52
118.24
96.32
65.15
47.19
131.22
87.57
87.39
96.03
81.14
89.68
114.80
81.09
114.95
40.72
-31.84
-81.30
-65.89
-81.30
-35.56
-63.40
-20.58
-59.08
Dhj
0.00
0.00
0.00
0.00
-14.46
-7.44
-19.43
-16.84
-21.94
-20.24
-0.32
-9.48
-48.51
-6.88
0.00
-51.37
-47.44
-28.00
-25.44
-59.08
-55.33
-56.25
-55.91
-53.19
-43.01
-63.40
-3.28
-6.48
-2.45
-11.89
-8.50
-6.14
-15.57
-51.48
-10.12
-29.40
-14.38
-40.76
-46.21
-43.76
-37.64
-47.96
-54.30
-40.17
-58.79
-61.00
-57.94
-40.09
-53.09
-65.89
-63.29
-81.30
-6.74
-43.85
3.10
18.47
-6.91
-43.85
-8.38
-31.47
Hybrid
0.00
0.00
0.00
0.00
0.48
-0.97
-3.46
-2.57
-3.68
-3.33
0.05
-3.81
-19.92
-1.37
0.00
-20.90
-22.68
-24.68
-11.34
-31.47
-26.42
-27.98
-30.19
17.77
-2.27
-43.85
18.42
-2.82
0.05
3.26
-4.04
-1.40
-3.34
-25.55
-1.83
-6.94
18.59
-2.87
-19.43
-2.73
-15.99
-28.34
-6.56
-1.29
-7.88
1.13
-9.94
-0.52
-3.77
-1.45
18.47
-0.86
-44.41
-90.59
-64.41
-76.88
-37.98
-83.12
-48.87
-90.59
gm2hat
0.00
0.00
0.00
0.28
0.88
-86.11
-89.40
-88.93
-90.00
-86.81
-1.89
-33.99
-90.59
-18.53
0.01
-82.57
-81.98
-68.07
-48.47
-82.92
-77.26
-76.55
-77.43
-49.62
-52.72
-83.12
-4.23
-8.44
-3.21
-14.48
-10.60
-7.14
-17.96
-36.98
-11.37
-33.02
-15.50
-43.64
-25.16
-46.52
-39.96
-50.35
-51.61
-41.45
-60.66
-62.84
-59.62
-41.18
-51.73
-66.28
-62.77
-76.88
Performance Measure: Bias (%)
Sample size: 10.0%
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
3.20
17.61
-31.16
-61.00
-65.87
-81.21
87.45
1125.89
(50
<= gamma2 < inf)
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
51.27
639.24
-63.91
-76.47
-26.62
-81.21
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Average:
Maximum:
45.14
73.54
-25.51
-76.47
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
50)
Average:
Maximum:
31.39
639.24
(1.0 <= gamma2 <
Average:
Maximum:
38
39
(50
<= gamma2 < inf)
50)
45.14
73.54
31.39
639.24
Average:
Maximum:
51.27
639.24
3.20
17.61
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
Average:
Maximum:
Average:
Maximum:
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
(1.0 <= gamma2 <
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
Performance Measure: RMS Error (%)
Sample size: 10.0%
25.79
76.47
63.92
76.47
31.41
61.27
11.30
39.80
Duj1
0.00
0.00
0.00
0.01
2.88
1.18
5.35
3.08
4.14
3.90
0.26
6.38
22.21
5.00
0.00
26.37
28.04
30.55
20.56
39.80
37.53
40.58
42.05
29.84
32.67
54.08
4.07
6.88
2.56
11.33
8.56
6.82
16.02
32.80
10.11
30.98
14.73
41.77
24.17
43.78
37.77
47.96
49.84
40.20
58.95
61.27
58.09
40.13
51.27
65.68
62.27
76.47
26.89
81.21
65.88
81.21
32.59
61.28
12.14
42.32
Dsj1
0.00
0.00
0.00
0.00
2.76
1.24
5.50
3.20
4.28
4.08
0.36
8.24
22.92
6.69
0.00
27.93
29.88
30.55
24.26
42.32
40.72
44.21
45.68
33.80
39.35
57.54
4.07
6.88
2.57
12.02
8.57
6.82
16.03
35.44
10.14
30.98
14.74
41.77
26.71
43.81
37.80
47.99
54.19
40.22
58.97
61.28
58.11
40.14
53.09
65.93
63.30
81.21
103.38
1133.61
682.55
1133.61
90.96
267.08
9.05
31.73
Duj2
0.00
0.00
0.00
0.01
3.28
1.01
4.91
2.72
3.70
3.35
0.22
3.81
20.23
1.78
0.00
20.95
22.72
31.73
11.61
31.49
26.43
27.99
30.19
19.70
2.37
43.85
10.52
10.78
6.32
22.99
8.05
26.66
46.81
173.44
41.14
87.80
139.37
147.35
186.15
86.77
85.05
52.65
202.76
205.67
160.67
190.26
173.42
267.08
665.37
549.70
1133.61
381.51
16.71
120.14
17.82
27.30
19.22
48.12
13.26
120.14
Duj2a
0.00
0.00
0.00
0.01
3.28
1.01
4.91
2.72
3.70
3.35
0.22
3.81
20.23
1.78
0.00
20.95
22.72
120.14
11.61
31.49
26.43
27.99
30.19
9.85
2.39
43.85
30.18
6.48
1.55
6.02
4.73
6.46
9.33
25.55
2.41
37.63
23.01
36.24
19.44
11.64
20.60
28.85
6.95
10.13
30.53
48.12
28.57
10.38
4.12
17.55
22.33
27.30
32.94
281.98
115.77
281.98
38.74
186.15
9.71
31.90
Dsj2
0.00
0.00
0.00
0.00
3.14
1.01
4.92
2.73
3.71
3.36
0.36
6.36
20.31
5.70
0.00
21.11
23.17
30.55
18.42
31.90
27.21
29.09
31.53
18.95
11.43
45.63
4.07
6.88
2.57
12.02
8.57
6.82
16.03
173.42
10.14
30.98
14.74
41.77
186.15
43.81
37.80
47.99
6.84
40.22
58.97
61.28
58.11
40.14
51.85
65.93
63.30
281.98
32.71
200.79
15.50
28.97
34.96
107.16
33.09
200.79
DSh
0.00
0.00
0.00
0.12
200.79
0.67
3.02
2.28
3.64
2.86
1.53
39.52
48.28
20.50
0.00
19.44
95.50
120.14
51.45
25.76
59.49
63.90
70.90
19.54
107.16
43.61
30.84
10.24
3.13
13.96
3.65
11.00
14.94
78.56
6.32
58.77
36.24
54.79
95.08
18.92
17.47
17.74
24.15
15.24
30.90
52.92
27.92
16.09
12.32
14.45
28.97
6.27
24.18
73.14
58.82
73.14
29.16
54.03
11.19
44.83
DSh2
0.00
0.00
0.00
0.01
2.55
0.67
3.02
2.28
3.64
2.86
0.24
6.26
14.86
4.74
0.00
31.32
25.34
32.32
20.23
44.83
39.85
37.55
40.22
31.23
32.72
53.88
4.21
6.40
2.21
9.93
7.89
6.28
14.16
32.21
8.71
28.30
10.97
36.78
26.43
38.63
34.77
45.12
47.46
35.69
53.19
54.03
52.75
35.55
47.16
59.87
55.11
73.14
36.09
357.43
11.51
21.81
50.17
357.43
22.68
131.15
DSh3
0.00
0.00
0.00
0.02
131.15
0.59
2.73
2.04
3.26
2.53
0.21
8.59
41.41
2.23
0.00
15.78
73.86
120.14
8.91
18.35
44.43
45.05
46.86
15.78
54.91
17.28
357.43
22.36
3.66
4.93
4.01
21.60
9.87
60.23
7.53
107.65
47.58
75.03
77.15
47.55
42.01
12.19
8.71
36.67
43.68
100.21
45.95
38.75
4.16
16.96
21.81
3.11
61.69
253.68
70.53
93.87
73.33
172.40
45.05
253.68
DHTj
0.00
0.00
0.00
0.59
253.68
0.73
3.22
2.44
3.89
3.08
5.04
83.32
52.96
58.64
0.00
22.07
111.82
120.14
121.15
31.52
71.70
79.95
92.07
22.62
172.40
69.42
31.08
32.43
15.09
50.13
23.53
40.75
69.18
92.61
40.50
122.63
90.80
141.58
107.92
79.72
61.91
37.80
100.67
68.11
85.04
103.19
78.42
70.27
86.66
67.62
93.87
33.96
76.09
318.00
88.86
115.81
88.30
208.23
57.95
318.00
DHTsj
0.00
0.00
0.00
1.08
318.00
0.77
3.37
2.56
4.07
3.23
14.04
146.14
56.61
105.34
0.00
23.81
124.33
120.14
179.12
35.01
79.30
89.58
105.02
24.38
208.23
84.07
31.78
36.52
22.79
88.30
34.21
44.12
91.30
102.76
59.71
132.99
119.25
152.70
118.24
98.61
73.07
49.30
131.30
89.84
98.70
113.53
91.89
91.92
114.84
83.97
115.81
40.81
32.06
81.30
65.91
81.30
35.80
63.40
20.81
59.08
Dhj
0.00
0.00
0.00
0.00
14.60
7.49
21.19
17.00
21.96
20.25
0.37
9.48
48.55
6.92
0.00
51.38
47.44
30.55
25.46
59.08
55.33
56.25
55.91
53.19
43.01
63.40
4.07
6.88
2.57
12.02
8.57
6.82
16.03
51.48
10.14
30.98
14.74
41.77
46.22
43.81
37.80
47.99
54.30
40.22
58.97
61.28
58.11
40.14
53.09
65.93
63.30
81.30
14.69
48.12
11.51
21.81
19.55
48.12
9.05
31.73
Hybrid
0.00
0.00
0.00
0.01
3.28
1.01
4.91
2.72
3.70
3.35
0.22
3.81
20.23
1.78
0.00
20.95
22.72
31.73
11.61
31.49
26.43
27.99
30.19
18.52
2.37
43.85
30.18
6.48
1.55
6.02
4.73
6.46
9.33
25.55
2.41
37.63
23.01
36.24
19.44
11.64
20.60
28.85
6.95
10.13
30.53
48.12
28.57
10.38
4.16
16.96
21.81
3.11
44.93
90.68
64.43
76.89
38.34
83.12
49.68
90.68
gm2hat
0.00
0.00
0.00
0.58
1.73
86.27
90.31
89.06
90.01
86.85
6.23
34.01
90.68
19.35
0.36
82.64
82.02
74.31
48.77
82.93
77.26
76.55
77.43
49.85
52.72
83.12
5.34
8.95
3.38
15.23
10.71
7.94
18.49
37.42
11.42
34.79
15.88
44.71
26.08
46.57
40.13
50.37
51.68
41.50
60.85
63.13
59.80
41.24
51.73
66.32
62.79
76.89
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
-8.57
-33.01
Duj1
0.00
0.00
0.00
0.00
0.12
-0.99
-3.41
-2.66
-3.68
-3.37
-0.01
-2.43
-19.47
-1.50
0.00
-22.22
-22.70
-22.67
-12.06
-33.01
-29.89
-31.83
-32.79
-23.64
-19.23
-43.88
-1.62
-4.88
-1.24
-6.86
-5.71
-3.81
-9.00
-26.76
-5.74
-19.10
-5.97
-27.94
-19.91
-31.91
-29.05
-39.45
-36.42
-29.91
-46.54
-46.19
-45.45
-29.50
-38.07
-52.58
-46.73
-62.96
-24.49
-49.73
-9.55
-37.27
Dsj1
0.00
0.00
0.00
0.00
0.12
-1.10
-3.69
-2.89
-3.96
-3.68
-0.01
-3.20
-20.87
-1.86
0.00
-24.77
-25.77
-22.67
-14.31
-37.27
-34.52
-37.10
-38.24
-27.77
-25.81
-49.73
-1.62
-4.88
-1.24
-7.01
-5.71
-3.81
-9.00
-30.41
-5.76
-19.10
-5.97
-27.94
-22.56
-31.93
-29.07
-39.46
-39.54
-29.91
-46.56
-46.19
-45.47
-29.51
-39.26
-52.79
-47.58
-69.06
1087.89
2003.12
112.39
362.12
-4.99
-17.83
Duj2
0.00
0.00
0.00
0.00
0.25
-0.65
-2.61
-1.98
-2.86
-2.40
0.00
-1.23
-15.73
-0.17
0.00
-13.05
-14.15
-16.24
-4.24
-17.83
-11.92
-11.12
-12.91
33.89
19.30
-23.64
5.06
0.39
2.77
7.61
5.26
10.93
36.46
173.47
29.01
79.01
95.24
161.34
186.15
151.13
131.47
110.95
290.87
276.82
269.05
340.60
303.29
362.12
940.41
1026.51
2003.12
381.51
3.78
83.17
35.11
83.17
4.66
58.20
-3.33
18.67
Duj2a
0.00
0.00
0.00
0.00
0.25
-0.65
-2.61
-1.98
-2.86
-2.40
0.00
-1.23
-15.73
-0.17
0.00
-13.05
-14.15
18.67
-4.24
-17.83
-11.92
-11.12
-12.91
14.14
18.83
-23.64
4.08
-2.43
-0.29
-2.93
-3.40
-0.90
-1.28
-15.80
-0.44
13.27
11.00
13.51
-13.89
22.76
2.57
-8.45
12.08
9.09
16.10
58.20
15.68
12.05
6.70
13.17
37.38
83.17
0.38
381.51
60.47
381.51
-3.41
186.15
-6.20
-22.67
Dsj2
0.00
0.00
0.00
0.00
0.25
-0.67
-2.64
-2.01
-2.89
-2.44
-0.01
-3.06
-16.08
-1.85
0.00
-13.60
-15.68
-22.67
-13.20
-19.32
-14.36
-14.64
-17.34
32.63
-1.62
-30.62
-1.62
-4.88
-1.24
-7.01
-5.71
-3.81
-9.00
173.47
-5.76
-19.10
-5.97
-27.94
186.15
-31.93
-29.07
-39.46
-29.97
-29.91
-46.56
-46.19
-45.47
-29.51
-39.26
-52.79
-47.58
381.51
10.70
49.20
5.03
13.90
12.09
49.20
9.99
45.86
DSh
0.00
0.00
0.00
0.00
43.52
0.44
2.01
1.53
2.51
1.93
0.02
3.19
29.19
1.81
0.00
10.71
45.86
18.67
7.08
11.20
30.15
30.51
30.69
11.21
43.39
10.38
6.10
-1.70
0.08
-2.64
-1.84
1.32
3.30
37.85
1.10
22.98
8.76
26.43
49.20
10.52
4.27
-6.78
7.53
1.51
4.96
19.23
4.48
3.50
3.22
0.62
13.90
2.37
-16.45
-56.71
-42.53
-56.71
-20.13
-43.38
-6.75
-28.38
DSh2
-0.00
0.00
-0.00
0.00
0.11
0.44
2.01
1.53
2.51
1.93
-0.01
-2.21
-15.36
-1.25
0.00
-28.38
-22.55
-15.78
-10.95
-27.06
-26.74
-32.00
-33.41
-28.05
-16.77
-43.38
-0.33
-4.35
-1.02
-6.28
-5.07
-2.95
-6.95
-28.07
-4.61
-12.09
-3.52
-18.88
-22.48
-24.85
-23.51
-34.02
-31.71
-24.68
-37.97
-35.29
-37.14
-24.01
-32.18
-43.88
-37.33
-56.71
7.65
49.34
1.72
8.23
10.02
49.34
5.73
28.17
DSh3
-0.00
-0.00
0.00
0.00
20.81
0.33
1.55
1.19
1.99
1.49
-0.00
-0.72
21.34
-0.27
0.00
6.64
28.17
18.67
-2.94
4.28
17.83
16.53
14.40
7.31
22.65
-4.97
5.25
-0.83
-0.35
-4.84
-2.72
3.42
2.22
22.51
0.25
41.20
5.09
49.34
32.89
11.65
10.83
0.75
0.04
-1.89
8.34
22.98
8.09
0.36
-1.64
0.17
8.23
0.11
22.79
76.21
30.22
42.59
27.81
76.21
14.93
62.53
DHTj
0.00
0.00
0.00
0.00
62.53
0.53
2.35
1.79
2.90
2.26
0.08
12.41
35.30
7.62
0.00
14.19
62.08
18.67
30.45
17.99
42.45
45.54
49.32
14.80
76.21
32.07
6.98
1.43
2.59
5.25
4.67
5.64
17.52
52.17
9.20
39.11
20.55
42.51
62.89
34.46
25.36
13.27
38.29
24.78
28.48
41.71
28.62
27.33
33.51
27.52
42.59
17.26
30.72
103.25
41.25
56.84
37.01
101.60
20.61
103.25
DHTsj
0.00
0.00
0.00
0.00
103.25
0.59
2.61
1.99
3.21
2.52
0.31
25.39
40.67
16.48
0.00
16.61
76.25
18.73
51.14
22.02
51.05
55.53
61.52
17.20
101.60
42.64
8.11
2.93
4.51
12.96
8.30
7.31
26.64
63.81
16.02
47.41
33.21
56.43
75.88
46.03
32.20
19.47
55.11
35.68
38.94
52.81
38.81
38.32
48.07
37.74
56.84
22.36
-23.33
-69.12
-52.19
-69.12
-26.24
-52.10
-14.08
-46.60
Dhj
0.00
0.00
0.00
0.00
-4.79
-3.71
-11.73
-9.60
-12.97
-11.61
-0.01
-3.24
-36.35
-1.86
0.00
-38.46
-35.01
-22.67
-14.40
-46.60
-42.69
-43.56
-43.34
-39.77
-27.24
-52.10
-1.62
-4.88
-1.24
-7.01
-5.71
-3.81
-9.00
-39.15
-5.76
-19.10
-5.97
-27.94
-33.67
-31.93
-29.07
-39.46
-39.55
-29.91
-46.56
-46.19
-45.47
-29.51
-39.26
-52.79
-47.58
-69.12
0.65
58.20
1.72
8.23
4.87
58.20
-4.99
-17.83
Hybrid
0.00
0.00
0.00
0.00
0.25
-0.65
-2.61
-1.98
-2.86
-2.40
0.00
-1.23
-15.73
-0.17
0.00
-13.05
-14.15
-16.24
-4.24
-17.83
-11.92
-11.12
-12.91
19.76
18.83
-23.64
4.08
-2.43
-0.29
-2.93
-3.40
-0.90
-1.28
-15.80
-0.44
13.27
11.00
13.51
-13.89
22.76
2.57
-8.45
12.08
9.09
16.10
58.20
15.68
12.05
-1.64
0.17
8.23
0.11
-34.58
-81.00
-50.49
-63.36
-28.23
-67.41
-39.71
-81.00
gm2hat
0.00
0.00
0.00
0.16
0.48
-72.86
-78.61
-77.92
-80.05
-75.63
-0.02
-13.04
-81.00
-5.60
0.01
-69.47
-66.64
-55.18
-28.33
-68.64
-61.56
-60.13
-60.41
-39.40
-31.09
-67.41
-2.15
-6.35
-1.63
-8.80
-7.15
-4.41
-10.39
-30.23
-6.46
-21.45
-6.47
-29.91
-20.84
-33.94
-30.83
-41.43
-37.78
-30.87
-48.03
-47.59
-46.77
-30.31
-38.42
-53.08
-47.09
-63.36
Performance Measure: Bias (%)
Sample size: 20.0%
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
3.20
17.61
-23.12
-46.54
-52.17
-69.06
140.02
2003.12
(50
<= gamma2 < inf)
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
51.27
639.24
-50.09
-62.96
-20.59
-69.06
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Average:
Maximum:
45.14
73.54
-19.32
-62.96
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
50)
Average:
Maximum:
31.39
639.24
(1.0 <= gamma2 <
Average:
Maximum:
40
41
(50
<= gamma2 < inf)
50)
45.14
73.54
31.39
639.24
Average:
Maximum:
51.27
639.24
3.20
17.61
skew
0.00
0.00
0.00
0.00
0.00
17.61
5.64
6.87
4.70
9.95
0.50
0.24
1.18
0.81
-0.67
7.41
1.92
0.08
1.25
4.84
4.77
4.38
3.53
114.62
4.51
2.71
4.29
1.66
2.97
3.79
3.06
3.26
2.70
639.24
7.70
2.66
6.43
3.73
454.61
7.36
7.14
7.89
54.66
10.35
7.04
5.82
7.22
11.06
69.00
14.60
23.44
73.54
Average:
Maximum:
Average:
Maximum:
gamma2
0.00
0.00
0.00
0.00
0.00
0.01
0.04
0.04
0.05
0.05
0.18
0.23
0.31
0.37
0.38
0.47
0.52
0.70
0.75
0.93
0.94
1.13
1.19
1.53
1.63
1.87
3.22
3.33
3.35
3.76
3.90
6.30
6.55
7.60
7.99
8.12
12.96
14.27
15.68
16.16
16.98
19.70
24.17
30.85
31.71
33.03
34.68
37.75
81.63
114.38
166.18
234.81
(1.0 <= gamma2 <
Nclass
15469
1288928
624473
150
1500
591564
9595
110074
591564
1288927
874
51168
19000
906
3
110076
36000
3
930
109688
624472
624473
202462
437654
110076
100000
29
33
194
131
168
21
49
535328
909
10
73
17
221480
213
72
398
6155
235
61
37
62
233
14047
247
772
10384
Average:
Maximum:
Psize
15469
1288928
624473
15000
15000
597382
10000
113600
621498
1341544
82135
1547606
33750
41197
1547606
147811
111500
113600
20213
173805
1463974
1654700
633756
597382
931174
330000
931174
1547606
1547606
15469
624473
1547606
1547606
1463974
1547606
1463974
931174
597382
633756
633756
173805
931174
113600
1654700
173805
1341544
147811
1463974
624473
50000
50000
50000
(0.0 <= gamma2 < 1.0)
Name
DB01
DB02
DB03
EQ100
EQ10
DB04
GOOD
DB05
DB06
DB07
NGB/4
DB08
FRAME2
NGB/2
DB09
DB10
FRAME3
DB11
NGB/1
DB12
DB13
DB14
DB15
DB16
DB17
SUDM
DB18
DB19
DB20
DB21
DB22
DB23
DB24
DB25
DB26
DB27
DB28
DB29
DB30
DB31
DB32
DB33
DB34
DB35
DB36
DB37
DB38
DB39
DB40
Z20A
Z15
Z20B
Performance Measure: RMS Error (%)
Sample size: 20.0%
19.62
62.96
50.10
62.96
23.44
46.77
8.89
33.01
Duj1
0.00
0.00
0.00
0.00
1.02
0.99
3.86
2.69
3.68
3.37
0.04
2.44
19.52
1.56
0.00
22.23
22.71
27.49
12.10
33.01
29.89
31.83
32.79
23.65
19.23
43.88
2.65
5.24
1.34
6.97
5.81
4.67
9.61
26.76
5.77
21.24
6.43
29.29
19.91
32.01
29.34
39.49
36.42
29.98
46.77
46.63
45.70
29.58
38.07
52.63
46.76
62.96
20.88
69.06
52.19
69.06
24.81
49.73
9.86
37.28
Dsj1
0.00
0.00
0.00
0.00
0.98
1.10
4.14
2.92
3.97
3.68
0.04
3.20
20.92
1.90
0.00
24.77
25.78
27.49
14.34
37.28
34.52
37.11
38.24
27.77
25.82
49.73
2.65
5.24
1.34
7.11
5.81
4.67
9.61
30.41
5.78
21.24
6.44
29.29
22.57
32.02
29.36
39.50
39.54
29.99
46.79
46.63
45.72
29.59
39.26
52.84
47.60
69.06
150.28
2010.61
1093.07
2010.61
123.00
369.77
5.77
29.82
Duj2
0.00
0.00
0.00
0.00
1.13
0.66
3.09
2.01
2.87
2.40
0.04
1.24
15.81
0.56
0.00
13.08
14.18
29.82
4.55
17.84
11.92
11.12
12.91
34.01
19.31
23.65
9.22
8.16
3.61
9.71
7.05
22.87
41.51
173.47
29.36
105.63
102.01
187.15
186.15
154.43
140.80
113.29
291.28
281.68
289.60
369.77
325.72
367.45
940.73
1039.43
2010.61
381.51
15.20
83.69
37.30
83.69
17.44
76.57
8.12
79.16
Duj2a
0.00
0.00
0.00
0.00
1.13
0.66
3.09
2.01
2.87
2.40
0.04
1.24
15.81
0.56
0.00
13.08
14.18
79.16
4.55
17.84
11.92
11.12
12.91
14.32
18.84
23.65
10.28
5.63
0.94
3.84
3.86
6.18
6.18
15.80
1.23
35.57
13.84
33.61
13.90
26.26
16.11
11.07
12.21
13.33
34.66
76.57
33.26
15.62
6.83
19.98
38.68
83.69
29.69
381.51
130.30
381.51
32.79
186.15
6.53
27.49
Dsj2
0.00
0.00
0.00
0.00
1.04
0.68
3.12
2.04
2.90
2.44
0.04
3.06
16.15
1.89
0.00
13.63
15.70
27.49
13.24
19.34
14.36
14.64
17.35
32.79
1.67
30.62
2.65
5.24
1.34
7.11
5.81
4.67
9.61
173.47
5.78
21.24
6.44
29.29
186.15
32.02
29.36
39.50
29.98
29.99
46.79
46.63
45.72
29.59
39.26
52.84
47.60
381.51
15.23
79.16
7.73
15.12
18.14
49.20
12.91
79.16
DSh
0.00
0.00
0.00
0.00
43.60
0.44
2.08
1.54
2.51
1.93
0.06
3.19
29.22
1.97
0.00
10.72
45.88
79.16
7.46
11.21
30.15
30.51
30.70
11.21
43.40
10.40
11.32
6.12
1.11
3.68
3.12
8.97
8.49
37.85
1.75
43.31
11.39
41.68
49.20
13.98
14.66
9.45
7.68
7.60
21.33
32.27
20.49
8.17
3.41
9.37
15.12
3.03
17.47
56.72
42.58
56.72
20.88
43.38
8.30
30.14
DSh2
0.00
0.00
0.00
0.00
0.99
0.44
2.08
1.54
2.51
1.93
0.04
2.21
15.38
1.32
0.00
28.38
22.56
30.14
11.01
27.06
26.74
32.00
33.41
28.05
16.77
43.38
2.92
5.01
1.18
6.43
5.22
4.60
8.00
28.07
4.66
17.80
4.59
22.13
22.48
25.09
24.18
34.11
31.71
24.84
38.62
36.50
37.81
24.18
32.19
44.02
37.39
56.72
13.44
79.16
6.32
10.62
17.91
74.99
9.05
79.16
DSh3
0.00
0.00
0.00
0.00
20.90
0.33
1.69
1.20
1.99
1.49
0.04
0.73
21.39
0.60
0.00
6.65
28.20
79.16
3.43
4.31
17.84
16.53
14.41
7.31
22.65
5.01
10.76
7.79
1.03
5.20
3.66
13.91
8.41
22.51
1.41
74.99
8.33
71.90
32.89
15.94
21.07
8.98
1.54
8.18
26.77
38.65
25.80
7.96
2.03
10.62
10.47
2.15
26.24
79.16
30.99
43.23
32.05
76.22
17.86
79.16
DHTj
0.00
0.00
0.00
0.00
62.66
0.53
2.39
1.79
2.90
2.26
0.20
12.41
35.32
7.93
0.00
14.20
62.10
79.16
30.75
17.99
42.45
45.54
49.32
14.80
76.22
32.08
12.07
9.77
3.78
8.38
7.02
15.15
22.33
52.17
9.58
58.57
24.32
56.68
62.89
36.48
30.92
15.62
38.35
27.07
38.15
50.75
38.19
29.27
33.55
29.81
43.23
17.36
33.76
103.33
41.81
57.28
40.54
101.60
23.52
103.33
DHTsj
0.00
0.00
0.00
0.00
103.33
0.59
2.64
1.99
3.21
2.52
0.47
25.39
40.69
16.68
0.00
16.61
76.26
79.16
51.34
22.02
51.05
55.53
61.52
17.21
101.60
42.64
12.97
11.04
5.40
14.80
9.92
16.63
30.43
63.81
16.26
65.09
35.85
67.79
75.88
47.54
36.96
21.15
55.15
37.22
46.20
60.33
45.90
39.69
48.10
39.42
57.28
22.44
23.60
69.12
52.20
69.12
26.56
52.10
14.34
46.61
Dhj
0.00
0.00
0.00
0.00
4.87
3.72
12.16
9.62
12.98
11.61
0.04
3.25
36.36
1.91
0.00
38.47
35.01
27.49
14.43
46.61
42.69
43.56
43.34
39.77
27.24
52.10
2.65
5.24
1.34
7.11
5.81
4.67
9.61
39.15
5.78
21.24
6.44
29.29
33.67
32.02
29.36
39.50
39.55
29.99
46.79
46.63
45.72
29.59
39.26
52.84
47.60
69.12
12.00
76.57
6.32
10.62
17.69
76.57
5.77
29.82
Hybrid
0.00
0.00
0.00
0.00
1.13
0.66
3.09
2.01
2.87
2.40
0.04
1.24
15.81
0.56
0.00
13.08
14.18
29.82
4.55
17.84
11.92
11.12
12.91
21.24
18.84
23.65
10.28
5.63
0.94
3.84
3.86
6.18
6.18
15.80
1.23
35.57
13.84
33.61
13.90
26.26
16.11
11.07
12.21
13.33
34.66
76.57
33.26
15.62
2.03
10.62
10.47
2.15
35.18
81.03
50.51
63.37
28.65
67.42
40.65
81.03
gm2hat
0.00
0.00
0.00
0.29
0.94
72.95
79.17
78.00
80.05
75.65
3.86
13.06
81.03
6.57
0.25
69.52
66.66
66.90
28.63
68.65
61.56
60.13
60.41
39.59
31.10
67.42
3.48
6.81
1.77
9.47
7.28
5.42
11.09
30.43
6.50
23.85
6.97
31.35
21.39
34.05
31.13
41.47
37.86
30.95
48.26
48.05
47.04
30.39
38.43
53.14
47.11
63.37