RJ 10025 (90521) Computer Science May 29, 1996 (Revised 3/20/98) Research Report ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Division Almaden Research Center 650 Harry Road San Jose, CA 95120-6099 Lynne Stokes Department of Management Science and Information Systems University of Texas Austin, TX 78712 LIMITED DISTRIBUTION NOTICE This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specic requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g., payment of royalties). IBM Research Division Yorktown Heights, New York San Jose, California Zurich, Switzerland ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Division Almaden Research Center 650 Harry Road San Jose, CA 95120-6099 e-mail: [email protected] Lynne Stokes Department of Management Science and Information Systems University of Texas Austin, TX 78712 e-mail: [email protected] ABSTRACT: We use an extension of the generalized jackknife approach of Gray and Schucany to obtain new nonparametric estimators for the number of classes in a nite population of known size. We also show that generalized jackknife estimators are closely related to certain Horvitz-Thompson estimators, to an estimator of Shlosser, and to estimators based on sample coverage. In particular, the generalized jackknife approach leads to a modication of Shlosser's estimator that does not suer from the erratic behavior of the original estimator. The performance of both new and previous estimators is investigated by means of an asymptotic variance analysis and a Monte Carlo simulation study. Keywords: jackknife, sample coverage, number of species, number of classes, database, census 1. Introduction The problem of estimating the number of classes in a population has been studied for many years. A recent review article (Bunge and Fitzpatrick 1993) lists more than 125 references. In this article, we consider an important special case of the general problem | estimating the number of classes in a nite population of known size. Only a handful of papers have addressed this problem and none has reached an entirely satisfactory solution, despite the fact that the rst attempt at solution appeared in the statistical literature nearly 50 years ago (Mosteller 1949). The problem we consider has arisen in the literature in a variety of applications, including the following. (i) In a company-sponsored contest, many entries (say several hundred thousand) have been received. It is known that some people have entered more than once. The goal is to estimate the number of dierent people who have entered from a sample of entries (Mosteller 1949; Sudman 1976). (ii) A sampling frame is constructed by combining a number of lists that may contain overlapping entries. It is desired to estimate, using a sample from all lists, the number of units on the combined list (Deming and Glasser 1959; Goodman 1952; Kish 1965, Sec. 11.2; Sudman 1976, Sec. 3.6). An important example of such a problem is an \administrative records census," currently under study by the U.S. Bureau of the Census. In such a census, several administrative les (such as AFDC or IRS records) are combined, and the total number of distinct individuals included in the combined le is determined. Exact computation of the number of distinct individuals in the combined le is extremely expensive because of the high cost of determining the number of duplicated entries. A similar problem and proposed solution was discussed in the London Financial Times (March 2, 1949) by C. F. Carter, who was interested in estimating the number of dierent investors in British industrial stocks based on samples from share registers of companies (Mosteller 1949). (iii) In a relational database system, data are organized in tables called relations (see, e.g., Korth and Silberschatz 1991, Chap. 3). In a typical relation, each row might represent a record for an individual employee in a company, and each column might correspond to a dierent attribute of the employee, such as salary, years of experience, department number, and so forth. A relational query species an output relation that is to be computed from the set of base relations stored by the system. Knowledge of the number of distinct values for each attribute in the base relations is central to determining the most ecient method for computing a specied output relation (Hellerstein and Stonebraker 1994; Selinger, Astrahan, Chamberlain, Lorie, and Price 1979). The size of the base relations in modern database systems often is so large that exact computation of the distinct-value parameters is prohibitively expensive, and thus estimation of these parameters is desired (Astrahan, Schkolnick, and Whang 1987; Flajolet and Martin 1985; Gelenbe and Gardy 1982; Hou, Ozsoyoglu, and Taneja 1 1988, 1989; Naughton and Seshadri 1990; Ozsoyoglu, Du, Tjahjana, Hou, and Rowland 1991; Whang, Vander-Zanden, and Taylor 1990). In each of these applications, the size of the population (number of contest entries, total number of units over all lists, and number of rows in the base relation) is known, and this size is too large for easy computation of the number of classes. The problem studied in this article can be described formally as follows. A population of size N consists of D mutually disjoint classesPof items, labelled C1 ; C2 ; : : : ; CD . Dene Nj to be the size of class Cj , so that N = Dj=1 Nj . A simple random sample of n items is selected (without replacement) from the population. This sample includes nj items from class Cj . The problem we consider is that of estimating D using information from the sample along with knowledge of the value ofP N . We denote by Fi the number of classes of size i in the population, so that D = Ni=1 Fi . Similarly, we denote by fi the number of classes represented exactly i times inPthe sample P and by d the total number of classes represented in the sample. Thus d = ni=1 fi and ni=1 ifi = n. Dene vectors N = (N1 ; N2 ; : : : ; ND ), n = (n1; n2; : : : ; nD ), and f = (f1 ; f2; : : : ; fn). Note that n is not observable, but f is. Because we sample without replacement, the random vector n has a multivariate hypergeometric distribution with probability mass function ,N ,N ,ND P (n j D; N) = n n,N nD : 1 1 2 2 n (1) The probability mass function of the observable random vector f is simply P (n j D; N) summed over all points n that correspond to f : P (f j D; N) = X S P (n j D; N); where S = f n : #(nj = i) = fi for 1 i D g. The probability mass function P (f j D; N) does not have a closed-form expression in general. In Section 2 we review the estimators that have been proposed for estimating D from data generated under model (1). In Section 3 we provide several new estimators of D based on an extension of the generalized jackknife approach of Gray and Schucany (1972). We then show that generalized jackknife estimators of the number of classes in a population are closely related to certain \Horvitz-Thompson" estimators, to an estimator due to Shlosser (1981), and to estimators based on the notion of \sample coverage" (Chao and Lee 1992). In Section 4 we provide and compare approximate expressions for the asymptotic variance of several of the estimators, and in Section 5 apply our formulas to a well-known example from the literature. We provide a simulation-based empirical comparison of the various estimators in Section 6, and summarize our results and give recommendations in Section 7. 2 2. Previous Estimators Bunge and Fitzpatrick (1993) mention only two non-Bayesian estimators that have been developed as estimators of D under model (1). These are the estimators of Goodman (1949) and Shlosser (1981). Goodman proved that X Db Good1 = d + (,1)i+1 (N , n + i , 1)! (n , i)! fi n (N , n , 1)! n! i=1 is the unique unbiased estimator of D when n > M def = max(N1 ; N2 ; : : : ; ND ). He further proved that no unbiased estimator of D exists when n M . Unfortunately, unless the sampling fraction is quite large, the variance of Db Good1 is so great and the numerical difculties encountered when computing Db Good1 are so severe that the estimator is unusable. Goodman, who made note of the high variance of Db Good1 himself, suggested the alternative estimator , 1) Db Good2 = N , Nn((N n , 1) f2 for overcoming the variance problem. Although Db Good2 has lower variance than Db Good1 , it can take on negative values and can have a large bias for any n if D is small. For example, consider the case in which D = 1 and n > 2, and observe that f2 = 0 and Db Good2 = N . Under the assumption that the population size N is large and the sampling fraction q = n=N is nonnegligible, Shlosser (1981) derived the estimator Pn bDSh = d + f1 Pn i=1(1 , q)ii,fi1 : iq(1 , q) fi i=1 For the two examples considered in his paper, Shlosser found that use of Db Sh with a 10% sampling fraction resulted in an error rate below 20%. In our experiments, however, we observed root mean squared errors (rmse's) exceeding 200%, even for well-behaved populations with relatively little variation among the class sizes (see Sec. 6). Considering the relationship between Db Sh and generalized jackknife estimators (see Sec. 3.4) provides insight into the source of this erratic behavior and suggests some possible modications of Db Sh to improve performance. In related work, Burnham and Overton (1978, 1979) proposed a family of (traditional) generalized jackknife estimators for estimating the size of a closed population when capture probabilities vary among animals. The D individuals in the population play the role of our D classes; a given individual can appear up to n times in the overall sample if captured on one or more of n possible trapping occasions. The capture probability for an individual is assumed to be constant over time, and the capture probabilities for the D individuals are modeled as D iid random samples from a xed probability distribution. Burnham and 3 Overton's sample design is clearly dierent from model (1). Under the Burnham and Overton model, for example, the quantities f1 ; f2 ; : : : ; fn have a joint multinomial distribution. Closely related to the work of Burnham and Overton are the ordinary jackknife estimators of the number of species in a closed region developed by Heltshe and Forrester (1983) and Smith and van Belle (1984). The sample data consist of a list of the species that appear in each of n quadrats. (The number of times that a species is represented in a quadrat is not recorded.) This setup is essentially identical to that of Burnham and Overton, with the D species playing the role of the D individuals and the n quadrats playing the role of the n trapping occasions. 3. Generalized Jackknife Estimators In this section we outline an extension of the generalized jackknife approach to bias reduction and then use this approach to derive new estimators for the number of classes in a nite population. We also point out connections between our generalized jackknife approach and several other estimation approaches in the literature. 3.1. The Generalized Jackknife Approach Let be an unknown real-valued parameter. A generalized jackknife estimator of is an estimator of the form b b G(b1 ; b2 ) = 11,,RR2 ; (2) where b1 and b2 are biased estimators of and R (6= 1) is a real number (Gray and Schucany 1972). The idea underlying the generalized jackknife approach is to try and choose R such that G(b1 ; b2 ) has lower bias than either b1 or b2 . To motivate the choice of R, observe that for b (3) R = E [b1 ] , ; E [2 ] , the estimator G(b1 ; b2 ) is unbiased for . This optimal value of R is typically unknown, however, and can only be approximated, resulting in bias reduction but not complete bias elimination. In the following, we extend the original denition of the generalized jackknife given by Gray and Schucany (1972) by allowing R to depend on the data; that is, we allow R to be random. Recall that d is the number of classes represented in the sample. Write dn for d to emphasize the dependence of d on the sample size, and denote by dn,1 (k) the number of classes represented in the sample after the kth observation has been removed. Set d(n,1) = n1 n X k=1 4 dn,1 (k): We focus on generalized jackknife estimators that are obtained by taking b1 = dn and b2 = d(n,1) in (2); these are the usual choices for b1 and b2 in the classical rst-order jackknife estimator (Miller 1974). Observe that dn,1 (k) = dn , 1 if the class for the kth observation is represented only once in the sample; otherwise, dn,1(k) = dn. Thus d(n,1) = dn , (f1=n) and, by (2), G(b1 ; b2 ) = Db , where Db = dn + K fn1 (4) and K = R=(1 , R). It follows from (3) that the optimal choice of K is dn ] : K = E [dE [dn]] ,, ED[d ] = DE,[fE] [=n (5) n 1 (n,1) To derive a more explicit formula for K , denote by I [A] the indicator of event A and observe that 2D 3 D D X X X 5 4 E [dn ] = E I [nj > 0] = P f nj > 0 g = D , P f nj = 0 g : j =1 j =1 Similar reasoning shows that D X P f nj = 1 g ; (6) PD P f n = 0 g j K = n PDj=1 : P fn = 1g (7) E [f1 ] = so that j =1 j =1 j =1 j Following Shlosser (1981), we focus on the case in which the population size N is large and the sampling fraction q = n=N is nonnegligible, and we make the approximation Nj P f nj = k g k qk (1 , q)Nj ,k (8) for 0 k n and 1 j D. That is, the probability distribution of each nj is approximated by the probability distribution of nj under a Bernoulli sample design in which each item is included in the sample with probability q, independently of all other items in the population. Use of this approximation leads to estimators that behave almost identically to estimators derived using the exact distribution of n but are simpler to compute and derive (see App. A for further discussion). Substituting (8) into (7), we obtain PD N j =1 (1 , q) j : N ,1 j =1 Nj q(1 , q) j K n PD 5 (9) The quantity K dened in (9) depends on unknown parameters N1 ; N2 ; : : : ; ND that are dicult to estimate. Our approach is to approximate K by a function of D and of other parameters that are easier to estimate, thereby obtaining an approximate version of (4). The estimates for these parameters, including Db for D, are then substituted into the approximate version of (4) and the resulting equation is solved for Db . We also consider \smoothed" jackknife estimators. The idea is to replace the quantity f1=n in (4) by its expected value E [f1] =n in the hope that the resulting estimator of D will be more stable than the original \unsmoothed" estimator. As with the parameter K , the quantity E [f1 ] =n depends on the unknown parameters N1 ; N2 ; : : : ; ND ; see (6) and (8). Thus our approach to estimating E [f1 ] =n is the same as our approach to estimating K . Estimators also can be based on high-order jackkning schemes that consider the number of distinct values in the sample when two elements are removed, when three elements are removed, and so forth. Typically, using a high-order jackkning scheme requires estimating high-order moments (skewness, kurtosis, and so forth) of the set of numbers f N1; N2 ; : : : ; ND g. Initial experiments indicated that the reduction in estimation error due to using the high-order jackknife is outweighed by the increase in error due to uncertainty in the moment estimates. Thus we do not pursue high-order jackknife schemes further. 3.2. The Estimators Dierent approximations for K and E [f1 ] =n lead to dierent estimators for D. Here we develop a number of the possible estimators. 3.2.1. First-Order Estimators The simplest estimators of D can be derived using a rst-order approximation to K . Specically, approximate each Nj in (9) by the average value N = D1 D X j =1 Nj = N D and substitute the resulting expression for K into (4) to obtain Db = dn + (1 , nq)f1 D : (10) ,1 bDuj1 = 1 , (1 , q)f1 dn: n (11) Now substitute Db for D on the right side of (10) and solve for Db . The resulting solution, denoted by Db uj1 , is given by We refer to this estimator as the \unsmoothed rst-order jackknife estimator." 6 To derive a \smoothed rst-order jackknife estimator," observe that by (6) and (8), D E [f1 ] 1 X Nj ,1 n n j =1 Nj q(1 , q) : (12) Approximating each Nj in (12) by N , we have E [f1 ] (1 , q)N ,1: (13) n On the right side of (10), replace f1 =n with the approximate expression for E [f1 ] =n given in (13), yielding Db = dn + D(1 , q)N : Replacing D with Db and N with N=Db in the foregoing expression leads to the relation , Db 1 , (1 , q)N=Db = dn : We dene the smoothed rst-order jackknife estimator Db sj1 as the value of Db that solves this equation. Given dn , n, and N , Db sj1 can be computed numerically using standard root-nding procedures. Observe that if in fact N1 = N2 = = ND = N=D, then , E [dn] D 1 , (1 , q)N=D : In this case Db sj1 can be viewed as a simple method-of-moments estimator obtained by replacing E [dn ] with the estimate dn and solving for D. If, moreover, the sampling fraction q is small enough so that the distribution of (n1 ; n2 ; : : : ; nD ) is approximately multinomial (see Sec. 3.3), then Db sj1 is approximately equal to the maximum likelihood estimator for D (see Good 1950). Observe that both Db uj1 and Db sj1 are consistent for D: Db uj1 ! D and Db sj1 ! D as q ! 1. 3.2.2. Second-Order Estimators A second-order approximation to K can be derived as follows. Denote by 2 the squared coecient of variation of the class sizes N1 ; N2 ; : : : ; ND : 2 = P (1=D) Dj=1(Nj , N )2 N2 : (14) Suppose that 2 is relatively small, so that each Nj is close to the average value N . Substitute the Taylor approximations (1 , q)Nj (1 , q)N + (1 , q)N ln(1 , q)(Nj , N ) 7 and Nj q(1 , q)Nj ,1 N jq (1 , q)N ,1 + (1 , q)N ,1 ln(1 , q)(Nj , N ) for 1 j D into (9) to obtain K D(1 , q) 1 , D(1 , q) 1 , ln(1 , q)N 2 : (15) 1 + ln(1 , q)N 2 The unknown parameter 2 can be estimated , using the following approach (cf. Chao and Lee 1992). With the usual convention that mn = 0 for n < m, we nd that N X i=1 i(i , 1)E [fi] N X i=1 = q2 = q2 so that 2 i(i , 1) D X j =1 D X j =1 D X Nj i j =1 Nj (Nj , 1) Nj ,i i q (1 , q) Nj X Nj , 2 i,2 i=2 Nj ,i i , 2 q (1 , q) Nj (Nj , 1); N D , 1: DX i ( i , 1) E [ f ] + i 2 n i=1 N Thus if D were known, then a natural method-of-moments estimator ^2 (D) of 2 would be n X D , 1: ^2 (D) = max 0; nD2 i(i , 1)fi + N i=1 To develop a second-order estimate of D, substitute (15) into (4) to obtain , Db = d + Df1(1 , q) 1 , ln(1 , q)N 2 ; n from which it follows that n , q) : Db = dn + Df1 (1n , q) , f1(1 , q) ln(1 q (16) (17) 2 Replacing D with Db on the right side of this equation and solving for Db yields the relation f (1 , q) 2 Db = d , f1(1 , q) ln(1 , q) : (18) 1, 1 n n 8 q An estimator of D can be obtained by substituting ^ 2 (Db ) for 2 in (18) and solving for Db numerically. Alternatively, we can start with a simple initial estimator of D and then correct this estimator using (18). Following this latter approach, we use Db uj1 as our initial estimator and dene ! ,1 2 (D b f (1 , q ) f (1 , q ) ln(1 , q )^ ) 1 1 uj1 Db uj2 = 1 , n dn , : q A smoothed second-order jackknife estimator can be obtained by replacing the expression f1 =n in (17) with the approximation to E [f1 ] =n given in (13), leading to , Db = dn + D(1 , q)N 1 , ln(1 , q)N 2 : Replacing D with Db and proceeding as before, we obtain the estimator Db sj2 = 1 , (1 , q)N~ ,1 dn , (1 , q)N~ ln(1 , q)N ^ 2 (Db uj1 ) ; where N~ = N=Db uj1 . As with the rst-order estimators Db uj1 and Db sj1 , the second-order estimators Db uj2 and Db sj2 are consistent for D. 3.2.3. Horvitz-Thompson Jackknife Estimators In this section we discuss an al- ternative approach to estimation of K based on a technique of Horvitz and Thompson. (See Sarndal, Swensson, and Wretman 1992 for a general discussion of Horvitz-Thompson estimators.) First, consider the general problem of estimating a parameter of the form P D (g) = j =1 g(Nj ), where g is a specied function. Observe that because P f nj > 0 g > 0 for 1 j D, we have (g) = E [X (g)], where X (g ) = D X g(Nj )I (nj > 0) j =1 X g(Nj ) = : P f nj > 0 g fj :nj >0g P f nj > 0 g It follows from (8) that P f nj > 0 g 1 , (1 , q)Nj , and the foregoing discussion suggests that we estimate (g) by b(g) = X g(Nbj ) ; Nb fj :nj >0g 1 , (1 , q) j (19) where Nbj is an estimator for Nj . The key point is that we need to estimate Nj only when nj > 0. To do this, observe that E [nj j nj > 0] = P fEn[n>j ] 0 g 1 , (1qN,j q)Nj : j 9 Replacing E [nj j nj > 0] with nj leads to the estimating equation nj = 1 , (1qN,j q)Nj ; (20) and a method-of-moments estimator Nbj can be dened as the value of Nj that solves (20). Now consider the problem of estimating K , and hence D. By (9), K (f )=(g), where f (x) = (1 , q)x and g(x) = xq(1 , q)x,1 =n. Thus a natural estimator of K is given by b(f )=b(g), leading to the nal estimator, b Db HTj = dn + b(f ) fn1 : (g ) A smoothed variant of Db HTj can be obtained by replacing f1 =n with the Horvitz-Thompson estimator of E [f1 ] =n, namely b(g). The resulting estimator, denoted by Db HTsj, is given by Db HTsj = dn + b(f ): Finally, a hybrid estimator can be obtained using a rst-order approximation for the numerator of K and a Horvitz-Thompson estimator for the denominator. This leads to the estimator Db hj , dened as the solution Db of the equation ! N=Db = dn : Db 1 , f1(1 ,b q) n(g) If we replace f1 =n with the Horvitz-Thompson estimator for E [f1 ] =n in the foregoing equation in order to obtain a smoothed variant of Db hj , then the resulting estimator coincides with Db sj1 . Because D = (u), where u(x) 1, it may appear that a \non-jackknife" HorvitzThompson estimator Db HT can be dened by setting Db HT = b(u). It is straightforward to show, however, that Db HT = Db HTsj , so that Db HT can in fact be viewed as a smoothed jackknife estimator. Simulation experiments indicate that the behavior of the Horvitz-Thompson jackknife estimators Db HTj and Db HTsj is erratic (see App. D for detailed results). Overall, the poor performance of Db HTj and Db HTsj is caused by inaccurate estimation of b(f ). The problem seems to be that when Nj is small, the estimator Nbj is unstable and yet typically has a bj , N b large eect on the value of (f ) through the term (1 , q) = 1 , (1 , q)Nbj . The estimator Db hj uses a Taylor approximation in place of b(f ) and hence has lower bias and rmse than the other two Horvitz-Thompson jackknife estimators. However, other estimators perform better than Db hj , and we do not consider the estimators Db HTj , Db HTsj, and Db hj further. 10 3.3. Relation to Estimators Based on Sample Coverage The generalized jackknife approach for deriving an estimator of D works for sample designs other than hypergeometric sampling. For example, the most thoroughly studied version of the number-of-classes problem is that in which the population is assumed to be innite and n is assumed to have a multinomial distribution with parameter vector = (1 ; 2 ; : : : ; D ); that is, n P (n j D; ) = n n n 1n1 2n2 DnD : 1 2 D (21) When we proceed as in Section 3.1 to derive a generalized jackknife estimator under the model in (21), the estimator turns out to be nearly identical to the \coverage-based" estimator proposed by Chao and Lee (1992). To see this, start again with (4) and select K as in (5). Because E [ dn ] , D = , D X j =1 (1 , j )n under the model in (21), it follows that PD v ( ) K = PD j =1 n j ; v ( ) j =1 j n,1 j where vn (x) = (1 , x)n . Set = 1=D and use the Taylor approximations vn(j ) vn () + (j , )vn0 () and , j vn,1 (j ) j vn,1( ) + (j , )vn0 ,1 ( ) in a manner analogous to the derivation in Section 3.2.2 to obtain P K (D , 1) + (n , 1) 2 ; (22) where 2 = ,1+ D Dj=1 j2 is the squared coecient of variation of the numbers 1 ; 2 ; : : : ; D . Denote by Db mult the estimator of D under the multinomial model. Then, by (4), , Db mult = dn + (D , 1) + (n , 1) 2 fn1 : (23) Replace D with Db mult and 2 with an estimator ~ 2 in (23) and solve for Db mult to obtain Cb) n , 1 ~ 2 , 1 ; Db mult = dbn + n(1 , n n C Cb 11 where Cb = 1 , (f1 =n). When the sample size n is large, the estimator Db mult is essentially the same as the estimator b Db = dn + n(1 , C ) ~2 CL Cb Cb proposed by Chao and Lee (1992). The estimator Db CL was developed from a dierent point of view, using the P concept of sample coverage. The sample coverage for an innite population is dened as Dj=1 j I [nj > 0], and the quantity Cb = 1 , (f1 =n) is a standard estimator of the sample coverage. Conversely, when Chao and Lee's derivation is modied to account for hypergeometric sampling, the resulting estimator is equal to Db uj2 (see App. B). Thus at least some estimators based on sample coverage can be viewed as generalized jackknife estimators. 3.4. Relation to Shlosser's Estimator Observe that the estimator Db Sh , though not developed from a jackknife perspective, can be viewed as an estimator of the form (4) with K estimated by Pn bKSh = n Pn i=1(1 , q)ii,fi1 : i=1 iq(1 , q) fi To analyze the behavior of Db Sh , we rst rewrite the jackknife quantity K dened in (9) as follows: PN i (24) K = n PN i=1(1 , q)i,F1i : iq (1 , q ) F i i=1 b Shlosser's justication of DSh assumes that E [fi ] Fi (25) E [f1 ] F1 for 1 i N . When the assumption in (25) holds and the sample size is large enough so that fi E [fi] (26) for 1 i N , PN i KbSh n PN i=1 (1 , q)i,E1[fi ] i=1 iq(1 , q) , E [fi ] P N (1 , q)i E [f ] =E [f ] i 1 = n PN i=1 , i , 1 E [fi ] =E [f1 ] i=1 iq(1 , q) ,1 PN (1 , q)i Fi F 1 n ,1 PN i=1 F1 i=1 iq(1 , q)i,1Fi = K; 12 so that Db Sh behaves as a generalized jackknife estimator. Although the relations in (25) and (26) hold exactly for n = N (implying that Db Sh is consistent for D), these relations can fail drastically for smaller sample sizes. For example, when F1 = 0 and Fi > 0 for some i > 1, the right side of (25) is innite, whereas the left side is nite for n suciently small. This observation leads one to expect that Db Sh will not perform well when the sample size is relatively small and N1 ; N2 ; : : : ; ND have similar values (with Nj > 1 for each j ). Both the variance analysis in Section 4 and the simulation experiments described in Section 6 bear out this conjecture. The foregoing discussion suggests that replacing Kb Sh with = KbSh K Kb E [KbSh ] Sh (27) is unbiased in the formula for Db Sh might result in an improved estimator, because Kb Sh for K . Of course we cannot perform this replacement exactly, since K and E [KbSh ] are as follows. Using the fact that unknown, but we can approximate Kb Sh E [fr ] = D X j =1 P f nj = r g D X Nj r j =1 r q (1 , q)Nj ,r = N X i r i=r i,r r q (1 , q) Fi (28) for 1 r n, we have, to rst order, PN (1 , q)i E [f ] i b E [KSh ] n PN i=1 i , 1 PiN=1(1iq(1, ,q)iq,)(1 +Eq[)fii,] 1F = n i=1 PN iq2 (1 , q2)i,1 F i : (29) i i=1 Using the rst-order approximation N1 = N2 = = ND = N together with (24), (27), and (29), we nd that ! N ,1 q(1 + q) Kb : Kb Sh (1 + q)N , 1 Sh We thus obtain a modied Shlosser estimator given by Db Sh2 = dn + f1 ! P q(1 + q)N~ ,1 P ni=1 (1 , q)i fi ; n iq(1 , q)i,1 f i (1 + q)N~ , 1 i=1 where N~ is an initial estimate of N based on an initial estimate of D. We set N~ equal to N=Db uj1 throughout. As with Db Sh , the estimator Db Sh2 is consistent for D. 13 An alternative consistent estimator of D can be obtained by directly using the expressions in (24), (27), and (29) with Fi estimated by f1 fi Fbi = Pn iq(1 (30) , q)i,1 fi i=1 for 1 i N ; these estimators of F1 ; F2 ; : : : ; FN were proposed by Shlosser (1981) in conjunction with the estimator Db Sh . Substituting the resulting estimator of K and E [KbSh ] into (27) leads to the nal estimator Pn iq2 (1 , q2)i,1 f ! Pn (1 , q)if 2 i Pn i=1iq(1 , q)i,i1f : Db Sh3 = dn + f1 Pn i=1 i , (1 , q) (1 + q)i , 1 f i i i=1 i=1 As with the estimator Db Sh , Shlosser's justication of the estimators in (30) rests on the assumption in (25). Thus one might expect that, like Db Sh , the estimator Db Sh3 will be unstable when the sample size is relatively small and N1 ; N2 ; : : : ; ND have similar values. relative to K b Sh leads one to expect that On the other hand, the reduction in bias of Kb Sh bDSh3 will perform better than Db Sh when 2 is suciently large. (One might be tempted to avoid the assumption in (25) when estimating F1 ; F2 ; : : : ; FN by taking a method-ofmoments approach: replace E [fr ] with fr in (28) for 1 r n and solve the resulting set of linear equations either exactly or approximately. As pointed out by Shlosser (1981), however, this system of equations is nearly singular, and hence extremely unstable.) 4. Variance and Variance Estimates Consider an estimator Db that is a function of the sample only through f = (f1 ; f2 ; : : : ; fM ), where M = max(N1 ; N2 ; : : : ; ND ). All of the estimators introduced in Section 3 are of this type. In general, we also allow Db to depend explicitly on the population size N and write Db = Db (f ; N ). Suppose that, for any N > 0 and nonnegative M -dimensional vector f 6= 0, the function Db is continuously dierentiable at the point (f ; N ) and Db (cf ; cN ) = cDb (f ; N ) (31) for c > 0. Approximating the hypergeometric sample design by a Bernoulli sample design as in (8), we can obtain the following approximate expression for the asymptotic variance of Db (f ; N ) as D becomes large: AVar[Db (f ; N )] M X i=1 A2i Var [fi] + X 1i;i 0 M i6=i 0 Ai Ai0 Cov [fi; fi0 ] ; (32) where Ai is the partial derivative of Db with respect to fi , evaluated at the point (f ; N ). (When computing each Ai , we replace each occurrence of n and dn in the formula for Db by 14 PM if and PM f before taking derivatives.) i=1 i i=1 i The approximation in (32) is valid when there is not too much variability in the class sizes (see App. C for a precise formulation and proof of this result). It follows from the proof that, to a good approximation, the variance of an estimator Db satisfying (31) increases linearly as D increases. Straightforward calculations show that each of the specic estimators Db uj1 , Db uj2 , Db Sh , bDSh2 , and Db Sh3 is continuously dierentiable as stated previously and also satises (31). Thus we can use (32) to study the asymptotic variance of these estimators. We focus on Db uj1, Db uj2 , Db Sh2 , and Db Sh3 because each of these estimators performs best for at least one population studied in the simulation experiments described in Section 6; we also consider Db Sh , because Db Sh is the most useful of the estimators previously proposed in the literature. Computation of the Ai coecients for each estimator is tedious, but straightforward. When Db = Db uj2 , for example, we obtain (uj1) N (1 , q ) ln(1 , q ) A(uj2) 1 = A1 , n , (1 , q )f 1 " !# (uj1) , 2 2 b A D ^ ^ 2 ^2 + f1 b1 ^2 + 1 , n ^2 + 1 , Nuj1 , n , (1 , q)f + n 1 Duj1 and N (1 , q) ln(1 , q) A(uj2) = A(uj1) i i , n , (1 , q)f1 ! (uj1) , 2 2 b b A D i ( i , 1) D i ^ i ^ 2 i uj1 , f1 bi ^2 + 1 , n ^2 + 1 , Nuj1 + n2 n , (1 , q)f1 + n Duj1 for 1 < i n, where ^2 = ^ 2 (Db uj1 ), A(uj1) 1 = Db uj1 and A(uj1) i 1 (1 , q) f1 + 1 , dn n , (1 , q)f1 n = Db uj1 ; 1 i(1 , q)(f1=n) + dn n , (1 , q)f1 for 1 < i n. Figures 1 and 2 compare the variances of the estimators Db uj1 , Db uj2 , Db Sh , Db Sh2 , and Db Sh3 for a number of populations with equal class sizes. For these special populations, Db uj1 and Db uj2 are approximately unbiased, so that the relative variances of these estimators are appropriate measures of relative performance. It is particularly instructive to compare the variance of Db uj1 and Db uj2 , since Db uj2 is obtained from Db uj1 by adjusting the latter estimator to compensate for bias induced by the assumption of equal class sizes. This adjustment is unnecessary for our special populations, and a comparison allows evaluation of the penalty (i.e., the increase in variance) that is being paid for the adjustment. 15 standard deviation standard deviation 800 700 b uj1 D 600 b uj2 D 500 b Sh D 400 b DSh2 b Sh3 300 D 200 100 0 0.02 0.06 0.1 0.14 0.18 sampling fraction (q) Figure 1: Standard deviation of Db uj1 , Db uj2, Db Sh , Db Sh2 , and Db Sh3 (D = 15; 000 and N = 10). 160 140 120 100 80 60 40 20 0 b uj1 D b uj2 D b Sh2 D 0 20 40 60 80 100 class size (N ) Figure 2: Standard deviation of Db uj1 , Db uj2 , and Db Sh2 (D = 1500 and q = 0:10). Figure 1 displays the standard deviations of Db uj1 , Db uj2 , Db Sh , Db Sh2 , and Db Sh3 for an equal-class-size population with N = 15; 000 and D = 1500 (so that N = 10) as the sampling fraction q varies. Observe that Db uj2 is only slightly less ecient than Db uj1 , so that the penalty for bias adjustment is small in this case. Performance of the estimators Db uj1 and Db Sh2 is nearly indistinguishable. The most striking observation is that for this population, Db Sh and Db Sh3 are not competitive with the other three estimators. The relative performance of Db Sh and Db Sh3 is especially poor for small sampling fractions. On the other hand, the variance analysis indicates that modication of Db Sh as in (27) and (29) indeed reduces the instability of the original Schlosser estimator in this case. Thus we focus on the estimators Db uj1 , Db uj2 , and Db Sh2 in the remainder of this section and in the next section. (We return to the estimator Db Sh3 in Section 6, where our simulation experiments indicate that Db Sh3 can exhibit smaller rmse than the other estimators, but only at large sample sizes and for certain \ill-conditioned" populations in which 2 is extremely large.) Figure 2 compares the three estimators Db uj1 , Db uj2 , and Db Sh2 for equal-class-size populations with a range of class sizes; for these calculations the number of classes and the sampling fraction are held constant at D = 1500 and q = 10. This gure illustrates the diculty of precisely estimating D when the class size is small (but greater than 1). Again, we see that these three estimators perform similarly, with nearly equal variability when N exceeds about 40. We checked the accuracy of the variance approximation in some example populations by comparing the values computed from (32) with results of a simulation experiment. (This experiment is discussed more completely in Section 6 below.) Simulated sampling with q = 0:05, 0:10, and 0:20 from the population examined in Figure 1 (N = 15; 000, D = 1500) yields variance estimates within 10% (on average) of those calculated from (32). Similar results were found in sampling from an equal-class-size population with N = 15; 000 and D = 150. The only diculties we encountered occurred for equal-class-size populations with 16 class sizes of N = 1 and N = 2. For these small class sizes the variance approximation, which is based on the approximation of the hypergeometric sample design by a Bernoulli sample design, is not suciently accurate. In particular, the approximate variance strongly reects random uctuations in the sample size due to the Bernoulli sample design; such uctuations are not present in the actual hypergeometric sample design. Simulation experiments indicate that for N 3 the dierences caused by Bernoulli versus hypergeometric sampling become negligible. (Of course, if the sample design is in fact Bernoulli, then this problem does not occur.) In practice, we estimate the asymptotic variance of an estimator Db by substituting estimates for f Var [fi] : 1 i M g, and f Cov [fi ; fi0 ] : 1 i 6= i0 M g into (32). To obtain such estimates, we approximate the true population by a population with D classes, each of size N=D. Under this approximation and the assumption in (8) of a Bernoulli sample design, the random vector f has a multinomial distribution with parameters D and p = (p1 ; p2; : : : ; pn), where pi = N=D pbi = N=Db i i (N=D),i i q (1 , q) for 1 i n. It follows that Var [fi] = Dpi (1 , pi ) and Cov [fi ; fi0 ] = ,Dpipi0 . Each pi can be estimated either by i q (1 , q)(N=Db ),i or simply by fi =Db . It turns out that the latter formula yields better variance estimates, and so we take and f d Var[fi ] = fi 1 , bi D Cd ov[fi ; fi0 ] = , fibfi D 0 for 1 i; i0 n. These formulas coincide with the estimators obtained using the \unconditional approach" of Chao and Lee (1992). A computer program that calculates Db uj1 , Db uj2, Db Sh2 and their estimated standard errors from sample data can be obtained from the second author. 5. An Example The following example illustrates how knowledge of the population size N can aect estimates of the number of classes. When the population size N is unknown, Chao and Lee (1992, Sec. 3) have proposed that the estimator Db CL dened in Section 3.3 be used to 17 N 1000 b uj1 D b uj2 D b Sh2 D 455 502 455 (47) (60) (51) 10,000 709 788 707 (125) (161) (128) 100,000 752 835 749 (141) (183) (144) Table 1: Values of Db uj1 , Db uj2 , and Db Sh2 for three hypothetical combined lists. (Standard errors are in parentheses.) estimate the number of classes, because the formula for Db CL does not involve the unknown parameter N . When N is known, a slight modication of the derivation of Db CL leads to the unsmoothed second-order jackknife estimator Db uj2 (see App. B). Our example is based on one discussed by Chao and Lee (1992), who borrowed data rst described and analyzed by Holst (1981). These data arose from an application in numismatics in which 204 ancient coins were classied according to die type in order to estimate the number of dierent dies used in the minting process. Among the die types on the reverse sides of the 204 coins were 156 singletons, 19 pairs, 2 triplets, and 1 quadruplet (f1 = 156, f2 = 19, f3 = 2, f4 = 1, d = 178). Because the total number of coins minted is unknown in this case, model (1) is inappropriate for analyzing these data. But suppose that the same data had arisen from an application in which N was known. For example, suppose that the data were obtained by selecting a simple random sample of 204 names from a sampling frame that had been constructed by combining 5 lists of 200 names each (N = 1000), 50 lists of 200 names each (N = 10; 000), or 500 lists of 200 names each (N = 100; 000). In each case our object is to estimate the number of unique individuals on the combined list, based on the sample results. We focus on the three estimators Db uj1 , Db uj2, and Db Sh2 . The estimates for the three cases are given in Table 1; the standard errors displayed in Table 1 are estimated using the procedure outlined in Section 4. We would expect similar inferences to be made from the same data under the multinomial model and the nite population model when N is very large. Indeed, the value Db uj2 = 835 agrees closely with Chao and Lee's estimate Db CL = 844 (se 187) when N = 100;000. Moreover, when N = 100;000 we nd that ^ 2 (Db uj1 ) 0:13, which is the same estimate of 2 given by Chao and Lee. As the population size decreases, however, both our assessment of the magnitude of D and our uncertainty about that magnitude decrease, because we are observing a larger and larger fraction of both the population and the classes. The most extreme divergence between the estimate obtained using Db CL and estimates obtained using Db uj1 , Db uj2 , or Db Sh2 occurs when the sample consists of all singletons (f1 = n). In that case, Db CL = 1, whereas Db uj1 = Db uj2 = Db Sh2 = N . This result indicates that when the population size N is known, it is better to use an estimator that exploits knowledge 18 of N than to sample with replacement and use the estimator Db CL . In some applications, sampling with replacement is not even an option. For example, the only available sampling mechanism in at least one current database system is a one-pass reservoir algorithm (as in Vitter 1985). The empirical results in Section 6 indicate that, of the three estimators displayed in Table 1, Db uj2 is the superior estimator when 2 is small (< 1). Thus for our example, Db uj2 would be the preferred estimator, since ^ 2 (Db uj1 ) 0:13 in all three cases. Note that Db uj2 consistently has the highest variance of the three estimators in Table 1. The bias of Db uj2 is typically lower than that of Db uj1 or Db Sh2 when 2 is small, however, so that the overall rmse is lower. 6. Simulation Results This section describes the results of a simulation study done to compare the performance of the various estimators described in Section 3. Our comparison is based on the performance of the estimators for sampling fractions of 5%, 10%, and 20% in 52 populations. (Initial experiments indicated that the performance of the various estimators is best viewed as a function of sampling fraction, rather than absolute sample size. This is in contrast to estimators of, for example, population averages.) We consider several sets of populations. The rst set comprises synthetic populations of the type considered in the literature. Populations EQ10 and EQ100 have equal class sizes of 10 and 100. In populations NGB/1, NGB/2, and NGB/4, the class sizes follow a negative binomial distribution. Specically, the fraction f (m) of classes in population NGB/k with class size equal to m is given by m , 1 f (m) k , 1 rk (1 , r)m,k for m k, where r = 0:04. Chao and Lee (1992) considered populations of this type. The populations in the second set are meant to be representative of data that could be encountered when a sampling frame for a population census is constructed by combining a number of lists which may contain overlapping entries. Population GOOD and SUDM were studied by Goodman (1949) and Sudman (1976). Population FRAME2 mimics a sampling frame that might arise in an administrative records census of the type described in Section 1. One approach to such a census is to augment the usual census address list with a small number of relatively large administrative records les, such as AFDC or Food Stamps, and then estimate the number of distinct individuals on the combined list from a sample. We have constructed FRAME2 so that a given individual can appear at most ve times, but most individuals appear exactly once, mimicking the case in which four administrative lists are used to supplement the census address list. Population FRAME3 is similar to FRAME2, but for the FRAME3 population it is assumed that the combined list is made up of a number of small lists (perhaps obtained from neighborhood-level organizations) rather than a few 19 Name EQ10 EQ100 NGB/4 NGB/2 NGB/1 N D 2 15000 1500 0.00 15000 150 0.00 82135 874 0.18 41197 906 0.37 20213 930 0.75 Skew 0.00 0.00 0.50 0.81 1.25 Name N D GOOD 10000 9595 FRAME2 33750 19000 FRAME3 111500 36000 SUDM 330000 100000 2 0.04 0.31 0.52 1.87 Skew 5.64 1.18 1.92 2.71 Table 3: Characteristics of \merged list" populations. Table 2: Characteristics of synthetic populations. Name N D 2 Skew Z20A 50000 247 114.38 14.60 Z15 50000 772 166.18 23.44 Z20B 50000 10384 234.81 73.54 Table 4: Characteristics of \ill-conditioned" populations. large lists. The populations in the third set, denoted by Z20A, Z20B, and Z15, are used to study the behavior of the estimators when the data are extremely ill-conditioned. The class sizes in each of these populations follow a generalized Zipf distribution (see Knuth 1973, p. 398). Specically, Nj =N / j , , where equals 1.5 or 2.0. These populations have extremely high values of 2 . Descriptive statistics for these three sets of populations are given in Tables 2, 3, and 4. The column entitled \skew" displays the dimensionless coecient of skewness , which is dened by PD (N , N )3 =D = P j=1 j : D (N , N )2 =D 3=2 j =1 j The nal set comprises 40 real populations that demonstrate the type of distributions encountered when estimating the number of distinct values of an attribute in a relational database. Specically, the populations studied correspond to various relational attributes from a database of enrollment records for students at the University of Wisconsin and a database of billing records from a large insurance company. The population size N ranges from 15,469 to 1,654,700, with D ranging from 3 to 1,547,606 and 2 ranging from 0 to 81.63 (see App. D for further details). It is notable that values of 2 encountered in the literature (Chao and Lee 1992; Goodman 1949; Shlosser 1981; Sudman 1976) tend not to exceed the value 2, and are typically less than 1, whereas the value of 2 exceeds 2 for more than 50% of the real populations. For each estimator, population, and sampling fraction, we estimated the bias and rmse by repeatedly drawing a simple random sample from the population, evaluating the estimator, and then computing the error of each estimate. (When evaluating the estimator, we truncated each estimate below at d and above at N .) The nal estimate of bias was obtained by averaging the error over all of the experimental replications, and rmse was 20 sample size 2 5% 0 and < 1 1 and < 50 50 all 10% 0 and < 1 1 and < 50 50 all 20% 0 and < 1 1 and < 50 50 all Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum b uj1 D 13.48 43.81 38.14 70.47 74.11 85.09 30.95 85.09 11.30 39.80 31.41 61.27 63.92 76.47 25.79 76.47 8.89 33.01 23.44 46.77 50.10 62.96 19.62 62.96 b sj1 D 14.20 45.14 39.17 70.48 75.92 88.49 31.91 88.49 12.14 42.32 32.59 61.28 65.88 81.21 26.89 81.21 9.86 37.28 24.81 49.73 52.19 69.06 20.88 69.06 b uj2 D 11.84 39.56 65.34 186.15 388.77 564.57 68.61 564.57 9.05 31.73 90.96 267.08 682.55 1133.61 103.38 1133.61 5.77 29.82 123.00 369.77 1093.07 2010.61 150.28 2010.61 b sj2 D 12.27 39.67 45.25 186.15 77.78 112.13 34.44 186.15 9.71 31.90 38.74 186.15 115.77 281.98 32.94 281.98 6.53 27.49 32.79 186.15 130.30 381.51 29.69 381.51 b Sh D b Sh2 D 79.17 428.25 54.30 218.02 28.13 47.63 62.33 428.25 33.09 200.79 34.96 107.16 15.50 28.97 32.71 200.79 12.91 79.16 18.14 49.20 7.73 15.12 15.23 79.16 13.23 46.59 36.67 66.82 71.23 83.71 29.86 83.71 11.19 44.83 29.16 54.03 58.82 73.14 24.18 73.14 8.30 30.14 20.88 43.38 42.58 56.72 17.47 56.72 b Sh3 D 202.16 3299.10 93.92 1042.73 21.45 38.58 132.06 3299.10 22.68 131.15 50.17 357.43 11.51 21.81 36.10 357.43 9.05 79.16 17.91 74.99 6.32 10.62 13.44 79.16 ^ 2 56.65 96.72 46.51 91.70 74.72 85.55 52.78 96.72 49.68 90.68 38.34 83.12 64.43 76.89 44.93 90.68 40.65 81.03 28.65 67.42 50.51 63.37 35.18 81.03 Table 5: Average and maximum rmse (%) for various estimators. estimated as the square root of the averaged square error. We used 100 replications, which was sucient to estimate the rmse with a standard error below 5% in nearly all cases; typically the standard error was much less. Summary results from the simulations are displayed in Tables 5 and 6. Table 5 gives the average and maximum rmse's for each estimator of D over all populations with 0 2 < 1, with 1 2 < 50, and with 2 50, as well as the average and maximum rmse's for each estimator over all populations combined. Similarly, Table 6 gives the average and maximum bias for each estimator. In these tables, the rmse and bias are each expressed as a percentage of the true number of classes. Tables 5 and 6 also display the rmse and bias of the estimator ^ 2 (Db uj1 ) used in the second-order jackknife estimators; the rmse and bias are expressed as a percentage of the true value 2 and are displayed in the column labelled ^2 . Comparing Tables 5 and 6 indicates that for each estimator the major component of the rmse is almost always bias, not variance. Thus, even though the standard error can be estimated as in Section 4, this estimated standard error usually does not give an accurate picture of the error in estimation of D. Another consequence of the predominance of bias is that when 2 is large, the rmse for the second-order estimator Db uj2 does not decrease 21 sample size 2 5% 0 and < 1 1 and < 50 50 all 10% 0 and < 1 1 and < 50 50 all 20% 0 and < 1 1 and < 50 50 all Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum b uj1 D -12.71 -43.77 -37.95 -70.32 -74.10 -85.09 -30.54 -85.09 -10.93 -39.79 -31.16 -61.00 -63.91 -76.47 -25.51 -76.47 -8.57 -33.01 -23.12 -46.54 -50.09 -62.96 -19.32 -62.96 b sj1 D -13.43 -45.10 -38.99 -70.32 -75.91 -88.49 -31.51 -88.49 -11.78 -42.31 -32.34 -61.00 -65.87 -81.21 -26.62 -81.21 -9.55 -37.27 -24.49 -49.73 -52.17 -69.06 -20.59 -69.06 b uj2 D -10.76 -39.51 42.77 186.15 382.88 556.68 47.31 556.68 -8.38 -31.47 74.62 261.47 677.18 1125.89 87.45 1125.89 -4.99 -17.83 112.39 362.12 1087.89 2003.12 140.02 2003.12 b sj2 D -11.38 -39.62 -16.83 186.15 -22.16 110.28 -15.04 186.15 -9.31 -31.88 -10.44 186.15 24.90 280.63 -7.26 280.63 -6.20 -22.67 -3.41 186.15 60.47 381.51 0.38 381.51 b Sh D 71.11 427.53 39.13 218.01 22.92 44.65 50.80 427.53 28.12 200.49 25.41 107.16 11.57 27.09 25.44 200.49 9.99 45.86 12.09 49.20 5.03 13.90 10.70 49.20 b Sh2 D -10.98 -46.59 -36.35 -66.49 -71.22 -83.71 -28.79 -83.71 -9.47 -44.83 -28.61 -53.88 -58.78 -73.13 -23.20 -73.13 -6.75 -28.38 -20.13 -43.38 -42.53 -56.71 -16.45 -56.71 b Sh3 D 90.57 958.74 61.98 663.26 3.17 33.44 69.00 958.74 17.66 130.80 35.20 264.38 3.10 18.47 25.65 264.38 5.73 28.17 10.02 49.34 1.72 8.23 7.65 49.34 ^ 2 -55.75 -94.97 -46.19 -91.70 -74.71 -85.54 -52.24 -94.97 -48.87 -90.59 -37.98 -83.12 -64.41 -76.88 -44.41 -90.59 -39.71 -81.00 -28.23 -67.36 -50.49 -63.36 -34.58 -81.00 Table 6: Average and maximum bias (%) for various estimators. monotonically as the sampling fraction increases. (In all other cases the rmse decreases monotonically.) Comparing Db uj1 with Db sj1 and then comparing Db uj2 with Db sj2 , we see that smoothing a rst-order jackknife estimator never results in a better rst-order estimator. On the other hand, smoothing a second-order jackknife estimator can result in signicant performance improvement when 2 is large. Similarly, using higher-order Taylor expansions leads to mixed results. Second-order estimators perform better than rst-order estimators when 2 is relatively small, but not when 2 is large. The diculty is partially that the estimator ^ 2 (Db uj1 ) tends to underestimate 2 when 2 is large, leading to underestimates of the number of classes. Moreover, the Taylor approximations underlying Db uj1 , Db sj1 , Db uj2 , and Db sj2 are derived under the assumption of not too much variability between class sizes; this assumption is violated when 2 is large. There apparently is no systematic relation between the coecient of skewness for the class sizes and the performance of second-order jackknife estimators. As predicted in Sections 3.4 and 4, the estimators Db Sh and Db Sh3 behave poorly when 2 is relatively small, and Db Sh3 performs better than Db Sh when 2 is large. For small to medium values of 2 , the modied estimator Db Sh2 has a smaller rmse than Db Sh or Db Sh3 , and 22 sample size 2 5% 0 and < 1 1 and < 50 50 all 10% 0 and < 1 1 and < 50 50 all 20% 0 and < 1 1 and < 50 50 all Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum Average Maximum b uj2 D 11.84 39.56 65.34 186.15 388.77 564.57 68.61 564.57 9.05 31.73 90.96 267.08 682.55 1133.61 103.38 1133.61 5.77 29.82 123.00 369.77 1093.07 2010.61 150.28 2010.61 b uj2a D 19.46 192.64 27.47 54.51 23.00 36.60 23.89 192.64 13.26 120.14 19.22 48.12 17.82 27.30 16.71 120.14 8.12 79.16 17.44 76.57 37.30 83.69 15.20 83.69 b Sh2 D 13.23 46.59 36.67 66.82 71.23 83.71 29.86 83.71 11.19 44.83 29.16 54.03 58.82 73.14 24.18 73.14 8.30 30.14 20.88 43.38 42.58 56.72 17.47 56.72 b Sh3 D b hybrid D 202.16 3299.10 93.92 1042.73 21.45 38.58 132.06 3299.10 22.68 131.15 50.17 357.43 11.51 21.81 36.10 357.43 9.05 79.16 17.91 74.99 6.32 10.62 13.44 79.16 11.84 39.56 27.47 54.51 26.17 39.20 21.06 54.51 9.05 31.73 19.55 48.12 11.51 21.81 14.69 48.12 5.77 29.82 17.69 76.57 6.32 10.62 12.00 76.57 Table 7: Average and maximum rmse (%) of Db uj2 , Db uj2a , Db Sh2 , Db Sh3 , and Db hybrid . its performance is comparable to the generalized jackknife estimators. For extremely large values of 2 and also for large sample sizes, the estimator Db Sh3 has the best performance of the three Shlosser-type estimators. (For a 20% sampling fraction, Db Sh3 in fact has the lowest average rmse of all the estimators considered.) As indicated earlier, smoothing can improve the performance of the second-order jackknife estimator Db uj2 . An alternative ad hoc technique for improving performance is to \stabilize" Db uj2 using a method suggested by Chao, Ma, and Yang (1993). Fix c 1 and remove any class whose frequency in the sample exceeds c; that is, remove from the sample all members of classes f Cj : j 2 B g, where B = f 1 j D : nj > c g. Then compute the estimator Db uj2 from the reduced sample and subsequently increment it by jB j to produce the nal estimate, denoted by Db uj2a . (Here jB j denotes the number of elements in the set B .) When computing Db uj2 from the reduced sample, take the population size as P N , j 2B Nbj , where each Nbj is a method-of-moments estimator of Nj as in Section 3.2.3. P If n , j 2B nj = 0, then simply compute Db uj2 from the full sample. The idea behind this procedure is as follows. When 2 is large, the population consists of a few large classes and many smaller classes. By in eect removing the largest classes from the population, 23 we obtain a reduced population for which 2 is smaller, so that D is easier to estimate; the contribution to D from the jB j removed classes is then added back at the nal step of the estimation process. (We also experimented with another stabilization technique in which the k most frequent classes are removed for some xed k, but this technique is not as eective.) Preliminary experiments indicated that c approximately 50 yields the best performance. For larger values of c, not enough of the frequent classes are removed; for smaller c, the size of the reduced sample is too small, and the resulting inaccuracy of Db uj2 when computed from this sample osets the benets of the reduction in 2 . We therefore take c = 50 in our experiments. As can be seen from Table 7, the rmse for Db uj2a is indeed much lower than that for Db uj2 when 2 exceeds 1. Moreover, by comparing the rmse of Db sj2 and Db uj2a in Tables 5 and 7, respectively, it can be seen that stabilization is more eective than smoothing. Observe, however, that the performance of Db uj2a is worse than that of Db uj2 when 2 is small. Interestingly, experiments indicate that none of the other estimators that we consider appears to benet from stabilization, and we apply this technique only to Db uj2 . Overall, the most eective estimators appear to be Db uj2a , which has the smallest average rmse over the various populations, and Db Sh2 , which has the smallest worst-case rmse. Our next observation is based on a comparison of the bias and rmse of Db uj1 and Db Sh2 for all of the populations studied. The behavior of the two estimators is quite similar: the correlation between the bias of the estimators is 0.990 and the correlation between the rmse is 0.993. The rmse and bias of Db uj1 are usually slightly greater than the rmse and bias, respectively, of Db Sh2 . On the other hand, using Db Sh2 requires computation of f1 ; f2 ; : : : ; fn , whereas using Db uj1 requires computation only of f1. Thus, if computational resources are limited, then it may be desirable to use Db uj1 as a surrogate for Db Sh2 ; the quantity f1 can be computed eciently using \Bloom lter" techniques as described by Ramakrishna (1989). The experimental results show that the relative performance of the estimators is strongly inuenced by the value of 2 . As can be seen from Table 7, the estimator Db uj2 has the smallest average rmse when 0 2 < 1, the estimator Db uj2a has the smallest average rmse when 1 2 < 50, and the estimator Db Sh3 has the smallest average rmse when 2 50. These results indicate that it may be desirable to allow an estimator to depend explicitly on the (estimated) value of 2 . To illustrate this idea, we consider a simple ad hoc branching estimator, denoted by Db hybrid . The idea is to estimate 2 by ^ 2 (Db uj1 ), x parameters 0 < 1 < 2 , and set 8b > <Duj2 if 0 ^2 (2Db uj1) < 1 ; Db hybrid = >Db uj2a if 1 ^ (Db uj1 ) < 2 ; (33) :Db Sh3 if ^2 (Db uj1 ) 2 : Table 7 displays the estimated rmse for Db hybrid when 1 = 0:9 and 2 = 30. As can be seen, the rmse for the combined estimator Db hybrid almost never exceeds that for Db uj2 , Db uj2a , or Db Sh3 separately. 24 7. Conclusions Both new and previous nonparametric estimators of the number of classes in a nite population can be viewed as generalized jackknife estimators. This viewpoint has suggested ways to improve Shlosser's original estimator and has shed new light on certain HorvitzThompson estimators as well as estimators based on notions of \sample coverage." We have used delta-method arguments to develop estimators of the standard error of generalized jackknife estimators. As indicated by the example in Section 5, knowledge of the population size can lead to more precise estimation of the number of classes. Of the estimators considered, the best appears to be the branching estimator Db hybrid dened by (33), in which a modied Shlosser estimator is used when the coecient of variation of the class sizes is estimated to be extremely large and unsmoothed second-order jackknife estimators are used otherwise. The systematic development of such branching estimators is a topic for future research. If a nonbranching estimator is desired, then we recommend the stabilized unsmoothed second-order jackknife estimator Db uj2a , followed by the modied Shlosser estimator Db Sh2 . If computing resources are scarce, then Db uj1 is a reasonable estimator. The various estimators of D discussed in this article embody dierent approaches for dealing with the diculties caused by variation in the class sizes N1 ; N2 ; : : : ; ND . Such variation is reected by large values of 2 . First-order estimates simply approximate each Nj by N . It is well-known in the literature that such an approach tends to yield downwardly biased estimates (see Bunge and Fitzpatrick 1993). More sophisticated approaches considered here include Taylor corrections to the rst-order approximation, as in the estimators Db uj2 and Db sj2; the stabilization technique of Section 6, in which the population is in eect modied so that the variation in class sizes is reduced; the Horvitz-Thompson approach, in which the rst-order assumption is avoided by estimating explicitly each Nj such that nj > 0; and Shlosser's approach, which replaces the rst-order assumption with the assumption in (25) and in its purest form results in the estimators Db Sh and Db Sh3 . The poor performance of the Horvitz-Thompson estimators indicates that approaches based on direct estimation of the Nj 's are unlikely to be successful. The second-order Taylor correction is eective mainly for small values of 2 , and both the stabilization technique and Shlosser's approach are eective mainly for large values of 2 . Thus, until a better solution is found, the best estimators will result from a judicious combination of the various approaches considered here. 25 Acknowledgements Hongmin Lu made substantial contributions to the early phases of the work reported here, and an anonymous reviewer suggested the \stabilization" technique for Db uj2 used in Section 6. The Wisconsin student database was graciously provided by Bob Nolan of the University of Wisconsin-Madison Department of Information Technology (DoIT) and Je Naughton of the University of Wisconsin-Madison Computer Sciences Department. 26 References Astrahan, M., Schkolnick, M., and Whang, K. (1987), \Approximating the Number of Unique Values of an Attribute Without Sorting," Information Systems , 12, 11{15. Billingsley, P. (1986), Convergence of Probability Measures (2nd ed.), New York: Wiley. Bishop, Y., Fienberg, S., and Holland, P. (1975), Discrete Multivariate Analysis , Cambridge, MA: MIT Press. Bunge, J., and Fitzpatrick, M. (1993), \Estimating the Number of Species: A Review," Journal of the American Statistical Association , 88, 364{373. Burnham, K. P., and Overton, W. S. (1978), \Estimation of the Size of a Closed Population When Capture Probabilities Vary Among Animals," Biometrika , 65, 625{633. | (1979), \Robust Estimation of Population Size When Capture Probabilities Vary among Animals," Ecology , 60, 927{936. Chao, A., and Lee, S. (1992), \Estimating the Number of Classes via Sample Coverage," Journal of the American Statistical Association , 87, 210{217. Chao, A., Ma, M-C., and Yang, M. C. K. (1993), \Stopping Rules and Estimation for Recapture Debugging With Unequal Failure Rates," Biometrika , 80, 193{201. Chung, K. L. (1974), A Course in Probability Theory (2nd ed.), New York: Academic Press. Deming, W. E., and Glasser, G. J. (1959), \On the Problem of Matching Lists by Samples," Journal of the American Statistical Association , 54, 403{415. Flajolet, P., and Martin, G. N. (1985), \Probabilistic Counting Algorithms for Data Base Applications," Journal of Computer and System Sciences , 31, 182{209. Gelenbe, E., and Gardy, D. (1982), \On the Sizes of Projections: I," Information Processing Letters , 14, 18{21. Good, I. L. (1950), Probability and the Weighing of Evidence , London: Charles Grin. Goodman, L. A. (1949), \On the Estimation of the Number of Classes in a Population," Annals of Mathematical Statistics , 20, 572{579. | (1952), \On the Analysis of Samples From k Lists," Annals of Mathematical Statistics , 23, 632{634. Gray, H. L., and Schucany, W. R. (1972), The Generalized Jackknife Statistic , New York: Marcel Dekker. Hellerstein, J. M., and Stonebraker, M. (1994), \Predicate Migration: Optimizing Queries With Expensive Predicates," in Proceedings of the 1994 ACM SIGMOD International Conference on Managment of Data , pp. 267{276. 27 Heltshe, J. F., and Forrester, N. E. (1983), \Estimating Species Richness Using the Jackknife Procedure," Biometrics , 39, 1{11. Holst, L. (1981), \Some Asymptotic Results for Incomplete Multinomial or Poisson Samples," Scandinavian Journal of Statistics , 8, 243{246. Hou, W., Ozsoyoglu, G., and Taneja, B. (1988), \Statistical Estimators for Relational Algebra Expressions," in Proceedings of the Seventh ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems , pp. 276{287. | (1989), \Processing Aggregate Relational Queries With Hard Time Constraints," in Proceedings of the 1989 ACM SIGMOD International Conference on Managment of Data , pp. 68{77. Kish, L. (1965), Survey Sampling , New York: Wiley. Knuth, D. E. (1973), The Art of Computer Programming, Vol. 3: Sorting and Searching , Reading, MA: Addison-Wesley. Korth, H. F., and Silberschatz, A. (1991), Database System Concepts (2nd ed.), New York: McGraw-Hill. Miller, R. G. (1974), \The Jackknife{ A Review," Biometrika , 61, 1{17. Mosteller, F. (1949), \Questions and Answers," American Statistician , 3, 12{13. Naughton, J. F., and Seshadri, S. (1990), \On Estimating the Size of Projections," in Proceedings of the Third International Conference on Database Theory , pp. 499{ 513. Ozsoyoglu, G., Du, K., Tjahjana, A., Hou, W., and Rowland, D. Y. (1991), \On Estimating COUNT, SUM, and AVERAGE Relational Algebra Queries," in Database and Expert Systems Applications, Proceedings of the International Conference in Berlin, Germany, 1991 (DEXA 91), pp. 406{412. Ramakrishna, M. V. (1989), \Practical Performance of Bloom Filters and Parallel Free-Text Searching," Communications of the ACM , 32, 1237{1239. Sarndal, C.-E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling , New York: Springer-Verlag. Selinger, P. G., Astrahan, D. D., Chamberlain, R. A., Lorie, R. A., and Price, T. G. (1979), \Access Path Selection in a Relational Database Management System," in Proceedings of the 1979 ACM SIGMOD International Conference on Managment of Data , pp. 23{34. Sering, R. J. (1980), Approximation Theorems of Mathematical Statistics , New York: Wiley. Shlosser, A. (1981), \On Estimation of the Size of the Dictionary of a Long Text on the Basis of a Sample," Engineering Cybernetics , 19, 97{102. 28 Smith, E. P., and van Belle, G. (1984), \Nonparametric Estimation of Species Richness," Biometrics , 40, 119{129. Sudman, S. (1976), Applied Sampling , New York: Academic Press. Vitter, J. S. (1985), \Random Sampling With a Reservoir," ACM Transactions on Mathematical Software , 27, 703{718. Whang, K., Vander-Zanden, B. T., and Taylor, H. M. (1990), \A Linear-Time Probabilistic Counting Algorithm for Database Applications," ACM Transactions on Database Systems , 15, 208{229. A. Estimators Based On Hypergeometric Probabilities As in Section 3, denote by nj the number of elements in the sample that belong to class j for 1 j D. Under the hypergeometric model (1), we have P f nj = k g Nj N , Nj N = k N nn,(nk, 1) (nn, k + 1) (N , n)(N , n , 1) (N , n , N + k + 1) j = kj N (N , 1) (N , k + 1) (N , k)(N , k , 1) (N , N + 1) N q(q , 1 ) (q , k,1 ) ! N N = j j! N , k , 1 j (1 , q)(1 , q , N1 ) (1 , q , N ) (1 , Nk )(1 , kN+1 ) (1 , NjN,1 ) k 1(1 , N1 ) (1 , kN,1 ) for 1 j D and 0 k min(n; Nj ), where q = n=N . When N is large relative to Nj , we have the approximate equality given in (8). That is, P f nj = k g is approximately equal to the probability that nj = k under the Bernoulli sampling model. Estimators analogous to those in Section 3 can be derived using the exact hypergeometric probabilities. The starting point in such a derivation is the pair of identities P f nj = 0 g = hn (Nj ) and where nN P f nj = 1 g = N , nj+ 1 hn,1 (Nj ); ( ,(N ,x+1),(N ,n+1) if x N , n; 0 if x > N , n for x 0. (By an elementary property of the gamma function, hn (x) = ,(N ,n,x+1) ,(N +1) N hn (Nj ) = N ,n Nj n 29 for 1 j D.) It follows that the optimal value of the parameter K in (4) is given by PD h (N ) n j K = PD j =1 : Nj h ( N ) j =1 N ,n+1 n,1 j First-order and second-order jackknife estimators can now be derived using arguments parallel to those in Section 3.2. The second-order Taylor approximations use the identity h0n (x) = ,hn(x)gn (x) for x > 0 and n 1, where gn (x) = n X 1 : N , x , n + k k=1 The estimators analogous to those in Section 3.2 are ,1 bDuj1 = dn , f1 1 , (N , n + 1)f1 ; n nN ! (N , N~ , n + 1)f1 ,1 2 (D ~ ~ b ( N , N , n + 1) g ( N )^ ) f n , 1 uj1 1 dn + ; Db uj2 = 1 , and nN n Db sj2 = (1 , hn(N~ )),1 dn + N ^2 (Db uj1 )gn,1 (N~ )hn (N~ ) ; where N~ = N=Db uj1 and ^2 (D) = max n X ( N , 1) D D , 1: 0; Nn(n , 1) i(i , 1)fi + N i=1 Moreover, the smoothed rst-order jackknife estimator Db sj1 is dened as the value of Db that solves the equation , Db 1 , hn (N=Db ) = dn: Finally, Horvitz-Thompson estimators can be derived in a manner similar to that in Section 3.2.3. For each j such that nj > 0, dene the method-of-moments estimator Nbj of Nj as the value of Nb that solves the equation nj = qNb : 1 , hn (Nb ) 30 P Then dene the estimator of b(g) of (g) = Dj=1 g(Nj ) as b(g) = X g(Nbj ) : b fj :nj >0g 1 , hn (Nj ) We compared the estimators based on the Bernoulli approximation (8) against the estimators based on the exact hypergeometric probabilities, using the populations described in Section 6. The error induced by the approximation (8) turned out to be less than 1% in all cases. The derivation of Db uj2 and Db sj2 using the exact hypergeometric probabilities assumes that Nj N , n for 1 j D. Without this assumption, Taylor approximations of hn (Nj ) fail because hn is not continuous, and the subsequent derivation for each estimator is inappropriate. We conclude by providing a technique for modifying Db uj2 and Db sj2 to deal with this problem. For concreteness, we focus on the unsmoothed second-order jackknife estimator Db uj2 . Denote by J the set of indices of the \big" classes: J = f j : Nj > N , n g. (Observe that if n < N=2, then J can contain at most one element.) If j 2 J , then with probability 1 class j is represented in the sample. We can decompose D according to D = jJ j + j f 1; 2; : : : ; D g , J j: (34) The rst term on the right side of (34) is the number of big classes, and the second term represents the number of classes in the reduced population that is formed by removing the big classes. We can estimate jJ j by the number of elements in the set Jb = f j : Nbj > N , n g, where Nbj is a method-of-moments estimator of Nj dened as the numerical solution of the equation E [nj j nj > 0] = nj (cf. Sec. 3.2.3). Since we assume a hypergeometric sampling model, Nbj is dened more precisely as the solution of the equation Nbj (n=N ) = n : j 1 , hn (Nbj ) To estimate the remaining term in (34), apply the unsmoothed second-order jackknife estimator to the reducedPpopulation obtained by removing the classes in Jb. Set N = P N , j 2Jb Nbj , n = n , j 2Jb nj , dn = dn , jJbj, f1 = jf j : nj = 1 and j 62 Jb gj, and , n + 1)f ,1 ( N = d , f1 1 bDuj1 1, : n n n N (Observe that Nbj = 1, and hence j 62 Jb, whenever nj = 1, so that f1 = f1 .) The modied version of Db uj2 is then given by ,1 = jJbj+ 1 , (N , N~ , n + 1)f1 bDuj2 n N ~ ~ 2 b ! dn + (N , N , n + 1)gn ,1 (N )^ (Duj1 )f1 n 31 ; . If nj = n for some j , then Nbj = N and D = 1. If Nbj > N , n b uj2 where N~ = N =Db uj1 for some j 62 Jb, then the foregoing process can be repeated. Similar modications can be made to the estimators Db sj1 and Db sj2 . B. Derivation of Db uj1 and Db uj2 Based on Sample Coverage For a nite population of size N , the sample coverage is dened as C= D X j =1 (Nj =N )I [nj > 0]: Using the approximation in (8), we have to rst order E [C ] = D N X j j =1 N P f nj > 0 g N D N , X j j =1 Nj N N 1 , (1 , q) 1 , (1 , q) : (35) Similarly, E [dn ] D 1 , (1 , q) , so that D E [dn ] =E [C ]. Observe that by (35) and (13), E [C ] 1 , (1 , q)E [f1 ] : n The foregoing relations suggest the method-of-moments estimator Db = dn =Cb, where Cb = 1 , (1 , q)f1 : n This estimator is identical to Db uj1 . To derive a second-order estimator, use a Taylor approximation as in Section 3.2.2 to obtain E [C ] D N , X j j =1 Nj N 1 , (1 , q) 1 , N1 D X j =1 Nj (1 , q)N + (1 , q)N ln(1 , q)(Nj , N ) = 1 , (1 , q)N , (1 , q)N ln(1 , q)N 2 ; where 2 is the squared coecient of variation of N1 ; N2 ; : : : ; ND . It follows that ! D 1 , (1 , q)N N ln(1 , q)N 2 (1 , q ) E [dn ] ; E [C ] 1 , (1 , q)N , (1 , q)N ln(1 , q)N 2 D 1 + 1 , (1 , q)N 32 and hence 2 N 2 D EE [[dCn]] , (1 , q) ln(1 , Nq)N E 1[C ] E [dn] , E [f1] (1 , q)nln(1 , q)N 1 , (1 , q) (36) where we have used the relations (35) and (13). Dene ^ 2 as in (16). Estimating E [dn ] by dn, E [f1 ] by f1 , 2 by ^ 2 (Db uj1 ), and E [C ] by Cb in (36), we obtain the formula for Db uj2 . C. Asymptotic Variance In this appendix we study the asymptotic variance of an estimator Db as D becomes large. Consider an innite sequence C1 ; C2 ; : : : of classes with corresponding class sizes N1; N2 ; : : : and construct a sequence of increasing populations in which the Dth population comprises classes C1 ; C2 ; : : : ; CD . As in (8), approximate the hypergeometric sample design by a Bernoulli sample design. Although the population size N depends on D, as does each sample statistic fi, we suppress this dependence in our notation. Suppose that there exists a nite, positive integer M and a positive real number such that Nj M (37) for j 1 and ! D p N p 1X lim D D , = Dlim D D (Nj , ) = 0: D!1 !1 i=1 (38) Also suppose that there exists a nonnegative vector = (1 ; 2 ; : : : ; M ) 6= 0 and a nonnegative symmetric matrix = ki;i0 k 6= 0 such that 0 D 1 p E [fi] p @1 X D D (j;i , i)A = 0; lim D D , i = Dlim !1 D!1 j =1 and (39) D X Var [ f ] 1 i lim = Dlim (1 , j;i) = i;i; D!1 D !1 D j =1 j;i (40) D Cov [fi; fi0 ] = lim , 1 X lim j;i j;i0 = i;i0 D!1 D!1 D D (41) for 1 i; i0 M , where j =1 j;i = Nij qi(1 , q)Nj ,i : 33 The conditions in (37){(41) are satised, for example, when the class-size sequence N1 ; N2 ; : : : is of the form I1 ; I2 ; : : : ; Ir ; I1 ; I2 ; : : : ; Ir ; : : : , where I1 ; I2 ; : : : ; Ir are xed nonnegative integers; in eect, this sequence of populations is obtained from an initial population by uniformly scaling up the initial Fi 's. As in Section 4, suppose that the estimator Db is a function of the sample only through f = (f1; f2; : : : ; fM ) and satises the condition in (31). Also suppose that the dierentiability assumption in Section 4 holds, so that Db is continuously dierentiable at the point (; ). P Write fi = Dj=1 I [nj = i] for 1 i M and observe that, under the foregoing assumptions, each fi is the sum of D independent (but not identically distributed) Bernoulli random variables. An application of Theorem 5.1.2 in Chung (1974) followed by (39) shows that lim (f ; N ) = (; ) (42) D!1 with probability 1, where f = f =D and N = N=D. Similarly, since (39){(41) hold by assumption, an application of Theorem B in Sering p (1980, Sec. 1.9.2) and then Slutsky's Theorem (see Sering 1980, Sec. 1.5.4) shows that D(f , ) ) N(0; ) as D ! 1, where \)" denotes convergence in distribution and N denotes a multivariate normal random variable. It then follows from (38) and Theorem 4.4 in Billingsley (1986) that p , , D (f ; N ) , (; ) ) N(0; ); 0 : Since Db is assumed dierentiable at (; ), an application of the Delta Method (see Bishop, Fienberg, and Holland 1975, Sec. 14.6) shows that p ,b D D(f ; N ) , Db (; ) ) N (0; Bt B) (43) as D ! 1, where B = r1 Db (; ), r1 Db denotes the gradient of Db (u; k) with respect to u, and N (0; Bt B) is a univariate normal random variable. Using (31) we can rewrite the foregoing limit as p1 ,Db (f ; N ) , Db (D; D) ) N (0; Bt B); D so that the asymptotic variance of Db (f ; N ) is equal to (Bt B)D. To approximate this asymptotic variance, set A = r1 Db (f ; N ) and let C = C(f ) be the covariance matrix of the random vector f . It follows from (31) that r1 Db (cu; ck) = r1Db (u; k) for any c; k > 0 and nonnegative M -dimensional vector u. Thus, lim A = Dlim r Db (f ; N ) = Dlim r Db (f ; N ) = r1Db (; ) = B !1 1 !1 1 D!1 34 (44) Name DB01 DB02 DB03 DB04 DB05 DB06 DB07 DB08 DB09 DB10 DB11 DB12 DB13 DB14 DB15 DB16 DB17 DB18 DB19 DB20 N D 2 15469 15469 0.00 1288928 1288928 0.00 624473 624473 0.00 597382 591564 0.01 113600 110074 0.04 621498 591564 0.05 1341544 1288927 0.05 1547606 51168 0.23 1547606 3 0.38 147811 110076 0.47 113600 3 0.70 173805 109688 0.93 1463974 624472 0.94 1654700 624473 1.13 633756 202462 1.19 597382 437654 1.53 931174 110076 1.63 931174 29 3.22 1547606 33 3.33 1547606 194 3.35 Skew Name N D 2 Skew 0.00 DB21 15469 131 3.76 3.79 0.00 DB22 624473 168 3.90 3.06 0.00 DB23 1547606 21 6.30 3.26 17.61 DB24 1547606 49 6.55 2.70 6.87 DB25 1463974 535328 7.60 639.24 4.70 DB26 1547606 909 7.99 7.70 9.95 DB27 1463974 10 8.12 2.66 0.24 DB28 931174 73 12.96 6.43 -0.67 DB29 597382 17 14.27 3.73 7.41 DB30 633756 221480 15.68 454.61 0.08 DB31 633756 213 16.16 7.36 4.84 DB32 173805 72 16.98 7.14 4.77 DB33 931174 398 19.70 7.89 4.38 DB34 113600 6155 24.17 54.66 3.53 DB35 1654700 235 30.85 10.35 114.62 DB36 173805 61 31.71 7.04 4.51 DB37 1341544 37 33.03 5.82 4.29 DB38 147811 62 34.68 7.22 1.66 DB39 1463974 233 37.75 11.06 2.97 DB40 624473 14047 81.63 69.00 Table 8: Characteristics of \database" populations with probability 1, where the third equality follows by (42) and the assumed continuity of r1Db . Using (40), (41), and (44), we nd that A CA = 1 lim D!1 Bt (D)B t with probability 1, and the asymptotic variance of Db can be approximated by At CA. D. Detailed Experimental Results This section contains further details about the experiments described in Section 6. Table 8 displays characteristics of the \database" populations used in the experiments. The printouts on the following pages contain simulation results for all of the estimators and for each experimental population. In the printouts, \Psize" denotes the population size, \Nclass" denotes the number of classes in the population, and \gm2hat" denotes the estimator ^ 2 (Db uj1 ) used to estimate 2 in the second-order jackknife estimators. 35 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 -12.71 -43.77 Duj1 0.00 0.00 0.00 0.10 0.83 -1.27 -4.31 -3.35 -4.32 -4.23 -1.57 -10.89 -22.87 -10.54 0.00 -28.55 -30.75 -30.33 -28.54 -43.77 -42.46 -46.24 -47.79 -35.18 -43.84 -59.61 -4.86 -9.00 -4.81 -19.87 -12.13 -8.90 -23.52 -36.72 -15.75 -37.30 -24.68 -51.11 -27.57 -53.16 -44.38 -54.02 -61.03 -49.14 -68.09 -70.32 -66.47 -49.24 -62.37 -75.27 -73.68 -85.09 -38.99 -70.32 -13.43 -45.10 Dsj1 0.00 0.00 0.00 0.10 0.82 -1.30 -4.39 -3.41 -4.40 -4.32 -2.59 -12.77 -23.23 -13.56 0.00 -29.46 -31.75 -30.33 -32.10 -45.10 -44.34 -48.36 -49.85 -39.18 -48.86 -61.38 -4.86 -9.00 -4.82 -22.41 -12.15 -8.90 -23.53 -38.84 -15.84 -37.30 -24.70 -51.12 -30.89 -53.20 -44.42 -54.05 -65.59 -49.16 -68.11 -70.32 -66.50 -49.26 -64.81 -75.54 -74.80 -88.49 382.88 556.68 42.77 186.15 -10.76 -39.51 Duj2 0.00 0.00 0.00 0.13 1.97 -1.17 -4.06 -3.16 -4.10 -3.93 -0.20 -7.44 -21.74 -4.77 0.00 -25.48 -27.83 -28.27 -20.02 -39.51 -36.34 -39.39 -41.36 3.48 -24.72 -54.50 0.47 2.45 8.92 35.93 8.17 14.88 35.95 172.75 43.50 18.93 108.93 31.84 186.15 28.46 34.32 14.34 115.69 132.99 39.91 39.74 64.00 172.90 426.37 242.22 556.68 306.23 -14.31 -54.50 -19.57 -33.52 -18.33 -54.50 -8.14 -39.51 Duj2a 0.00 0.00 0.00 0.13 1.97 -1.17 -4.06 -3.16 -4.10 -3.93 -0.20 -7.44 -21.74 -4.77 0.00 -25.48 -27.83 26.67 -20.02 -39.51 -36.34 -39.39 -41.36 3.48 -24.72 -54.50 27.93 -6.26 -0.64 21.21 -4.96 -4.15 -7.84 -32.01 -6.87 -16.75 5.57 -25.49 -22.81 -25.68 -29.41 -40.71 -32.23 -23.09 -29.89 -27.22 -33.42 -23.60 -17.84 -33.52 -5.17 -21.75 -15.04 186.15 -22.16 110.28 -16.83 186.15 -11.38 -39.62 Dsj2 0.00 0.00 0.00 0.12 1.96 -1.18 -4.06 -3.16 -4.10 -3.93 -1.97 -8.89 -21.76 -8.79 0.00 -25.53 -27.95 -30.33 -23.26 -39.62 -36.56 -39.70 -41.73 3.11 -27.82 -54.93 -4.86 -9.00 -4.82 -21.50 -12.15 -8.90 -23.53 172.67 -15.84 -37.30 -24.70 -51.12 186.15 -53.20 -44.42 -54.05 16.66 -49.16 -68.11 -70.32 -66.50 -49.26 -48.66 -75.54 -74.74 110.28 50.80 427.53 22.92 44.65 39.13 218.01 71.11 427.53 DSh 0.00 0.00 -0.00 12.00 427.53 0.81 3.50 2.69 4.31 3.42 33.14 252.82 61.36 150.37 0.00 25.88 140.58 26.67 222.19 38.67 87.27 98.96 117.75 25.81 218.01 96.75 26.02 -1.79 5.30 74.24 3.52 0.99 28.21 115.19 6.10 13.11 41.47 8.30 131.99 -1.47 -14.83 -26.37 71.92 3.21 3.25 10.27 -1.61 2.26 31.33 3.48 44.65 12.22 -28.79 -83.71 -71.22 -83.71 -36.35 -66.49 -10.98 -46.59 DSh2 -0.00 -0.00 -0.00 0.10 0.93 0.81 3.50 2.69 4.31 3.42 -1.53 -11.02 -16.55 -10.57 0.00 -32.32 -29.48 -27.62 -28.79 -46.59 -41.79 -46.91 -47.98 -32.33 -44.26 -59.32 -3.39 -8.66 -4.34 -17.84 -11.41 -8.43 -21.07 -33.04 -14.80 -34.90 -21.55 -48.29 -28.02 -50.73 -43.01 -52.73 -60.40 -46.67 -64.72 -66.49 -63.41 -46.81 -60.27 -71.78 -69.11 -83.71 69.00 958.74 3.17 33.44 61.98 663.26 90.57 958.74 DSh3 -0.00 0.00 0.00 2.43 340.21 0.76 3.28 2.54 4.08 3.23 5.96 117.91 56.81 49.40 0.00 23.45 123.23 958.74 101.73 33.21 75.06 82.59 95.45 22.83 140.08 68.93 663.26 -5.37 3.96 15.94 0.04 -1.64 19.34 100.92 -4.20 33.20 115.70 19.54 119.35 18.52 -16.30 -26.08 2.50 18.10 74.30 65.82 28.43 18.18 -18.40 -0.05 33.44 -2.31 116.06 487.54 131.25 183.86 128.39 295.35 97.31 487.54 DHTj 0.00 0.00 0.00 19.45 487.54 0.85 3.63 2.79 4.45 3.55 77.76 386.55 64.26 289.71 0.00 27.48 152.26 26.67 358.24 42.40 95.91 110.89 134.26 28.09 295.35 118.32 28.37 54.99 54.42 247.70 72.16 61.76 160.92 125.13 118.14 157.30 200.83 153.49 140.80 141.56 98.23 78.73 206.19 146.53 121.53 133.29 131.95 145.56 183.86 116.23 169.56 55.36 144.25 542.21 160.16 231.50 155.66 360.32 126.55 542.21 DHTsj 0.00 0.00 0.00 62.52 542.17 0.87 3.73 2.85 4.55 3.63 163.84 542.21 66.38 436.00 0.00 28.53 160.71 26.67 466.70 44.77 101.35 118.25 144.71 29.21 335.89 132.15 28.61 66.41 81.22 360.32 108.39 71.26 217.77 132.05 161.16 168.50 248.43 174.30 147.32 172.69 124.28 102.37 254.04 183.96 147.01 151.62 158.39 182.41 231.50 145.22 200.36 63.57 -39.64 -88.60 -75.94 -88.60 -43.74 -71.81 -27.44 -69.15 Dhj 0.00 0.00 0.00 -0.34 -26.66 -13.51 -25.00 -26.83 -32.89 -31.26 -2.95 -18.76 -59.37 -15.81 0.00 -62.42 -58.45 -30.33 -36.67 -69.15 -65.94 -66.70 -66.14 -64.76 -55.92 -71.81 -4.86 -9.00 -4.82 -22.42 -12.15 -8.90 -23.53 -62.27 -15.84 -37.30 -24.70 -51.12 -57.85 -53.20 -44.42 -54.05 -66.00 -49.16 -68.11 -70.32 -66.50 -49.26 -64.82 -75.54 -74.80 -88.60 -14.16 -54.50 -3.92 33.44 -18.33 -54.50 -10.76 -39.51 Hybrid 0.00 0.00 0.00 0.13 1.97 -1.17 -4.06 -3.16 -4.10 -3.93 -0.20 -7.44 -21.74 -4.77 0.00 -25.48 -27.83 -28.27 -20.02 -39.51 -36.34 -39.39 -41.36 3.48 -24.72 -54.50 27.93 -6.26 -0.64 21.21 -4.96 -4.15 -7.84 -32.01 -6.87 -16.75 5.57 -25.49 -22.81 -25.68 -29.41 -40.71 -32.23 -23.09 -29.89 -27.22 -33.42 -23.60 -18.19 -26.99 33.44 -3.94 -52.24 -94.97 -74.71 -85.54 -46.19 -91.70 -55.75 -94.97 gm2hat 0.00 0.00 0.00 0.75 1.41 -92.73 -93.62 -94.40 -94.96 -93.14 -10.81 -57.99 -94.97 -39.92 -0.03 -90.37 -90.82 -73.51 -66.92 -91.20 -87.53 -87.23 -87.95 -58.85 -70.66 -91.70 -6.19 -11.74 -6.23 -26.47 -15.21 -10.33 -27.13 -41.33 -17.76 -41.90 -26.64 -54.70 -28.81 -56.60 -47.18 -56.75 -63.23 -50.65 -70.31 -72.45 -68.45 -50.55 -63.03 -75.97 -74.29 -85.54 Performance Measure: Bias (%) Sample size: 5.0% gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 3.20 17.61 -37.95 -70.32 -75.91 -88.49 47.31 556.68 (50 <= gamma2 < inf) Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: 51.27 639.24 -74.10 -85.09 -31.51 -88.49 Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Average: Maximum: 45.14 73.54 -30.54 -85.09 Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B 50) Average: Maximum: 31.39 639.24 (1.0 <= gamma2 < Average: Maximum: 36 37 (50 <= gamma2 < inf) 50) 45.14 73.54 31.39 639.24 Average: Maximum: 51.27 639.24 3.20 17.61 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 Average: Maximum: Average: Maximum: gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 (1.0 <= gamma2 < Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B Performance Measure: RMS Error (%) Sample size: 5.0% 30.95 85.09 74.11 85.09 38.14 70.47 13.48 43.81 Duj1 0.00 0.00 0.00 0.65 7.02 1.40 8.21 3.96 4.46 4.27 1.72 10.89 23.76 10.64 0.00 28.66 30.84 31.80 28.63 43.81 42.46 46.24 47.79 35.20 43.84 59.61 5.31 9.38 4.95 20.03 12.21 9.63 23.87 36.72 15.77 38.33 24.98 51.75 27.59 53.21 44.49 54.03 61.03 49.19 68.21 70.47 66.59 49.29 62.37 75.30 73.69 85.09 31.91 88.49 75.92 88.49 39.17 70.48 14.20 45.14 Dsj1 0.00 0.00 0.00 0.63 6.84 1.43 8.30 4.02 4.53 4.36 2.67 12.77 24.12 13.63 0.00 29.57 31.83 31.80 32.16 45.14 44.34 48.36 49.86 39.19 48.86 61.38 5.31 9.38 4.96 22.53 12.24 9.63 23.87 38.85 15.86 38.33 25.00 51.75 30.90 53.24 44.52 54.07 65.60 49.20 68.23 70.48 66.62 49.31 64.81 75.57 74.81 88.49 68.61 564.57 388.77 564.57 65.34 186.15 11.84 39.56 Duj2 0.00 0.00 0.00 0.67 7.97 1.31 8.02 3.79 4.23 3.97 0.88 7.45 22.67 5.24 0.00 25.66 27.94 32.53 20.32 39.56 36.35 39.40 41.37 6.86 24.74 54.51 8.19 10.21 10.14 38.42 10.37 29.24 43.56 172.77 44.00 68.99 115.76 80.45 186.15 35.61 48.98 19.77 116.75 138.32 71.65 78.77 89.35 179.98 426.73 255.60 564.57 308.17 23.89 192.64 23.00 36.60 27.47 54.51 19.46 192.64 Duj2a 0.00 0.00 0.00 0.67 7.97 1.31 8.02 3.79 4.23 3.97 0.88 7.45 22.67 5.24 0.00 25.66 27.94 192.64 20.32 39.56 36.35 39.40 41.37 6.86 24.74 54.51 49.23 7.75 2.41 23.87 5.98 9.14 12.99 32.01 7.09 36.54 14.86 40.61 22.85 27.61 31.49 40.93 32.31 24.53 41.81 44.96 40.49 25.30 17.92 36.60 14.66 22.80 34.44 186.15 77.78 112.13 45.25 186.15 12.27 39.67 Dsj2 0.00 0.00 0.00 0.64 7.86 1.31 8.02 3.79 4.24 3.97 2.09 8.90 22.69 8.96 0.00 25.70 28.06 31.80 23.46 39.67 36.57 39.71 41.74 6.64 27.84 54.94 5.31 9.38 4.96 21.65 12.24 9.63 23.87 172.70 15.86 38.33 25.00 51.75 186.15 53.24 44.52 54.07 18.42 49.20 68.23 70.48 66.62 49.31 48.67 75.57 74.75 112.13 62.33 428.25 28.13 47.63 54.30 218.02 79.17 428.25 DSh 0.00 0.00 0.00 12.99 428.25 0.82 3.55 2.69 4.31 3.42 33.44 252.83 61.41 150.80 0.00 25.89 140.61 192.64 222.91 38.68 87.27 98.96 117.76 25.81 218.02 96.77 46.24 7.98 6.69 76.75 6.65 13.29 35.90 115.19 6.81 65.06 47.64 57.13 131.99 15.56 22.44 27.23 72.11 13.83 39.12 52.61 33.69 14.77 31.43 20.53 47.63 12.92 29.86 83.71 71.23 83.71 36.67 66.82 13.23 46.59 DSh2 0.00 0.00 0.00 0.65 7.35 0.82 3.55 2.69 4.31 3.42 1.69 11.02 18.96 10.67 0.00 32.33 30.07 33.06 28.88 46.59 41.79 46.91 47.98 32.33 44.26 59.32 5.04 9.11 4.52 18.05 11.51 9.35 21.59 33.04 14.82 36.66 22.04 49.36 28.02 50.81 43.16 52.75 60.41 46.73 64.95 66.82 63.63 46.88 60.27 71.83 69.13 83.71 132.06 3299.10 21.45 38.58 93.92 1042.73 202.16 3299.10 DSh3 0.00 0.00 0.00 2.76 341.21 0.77 3.38 2.55 4.09 3.23 6.13 117.92 56.88 49.75 0.00 23.47 123.27 3299.10 102.49 33.23 75.07 82.59 95.46 22.84 140.10 68.96 1042.73 7.73 5.68 18.83 5.27 11.88 28.85 100.92 4.73 114.18 129.57 78.89 119.36 31.78 26.52 27.35 4.29 27.63 117.57 124.91 67.58 29.64 18.47 23.60 38.58 5.15 127.09 488.14 133.63 183.95 142.86 295.36 105.57 488.14 DHTj 0.00 0.00 0.00 23.10 488.14 0.85 3.67 2.79 4.45 3.55 79.09 386.56 64.29 290.68 0.00 27.48 152.27 192.64 359.19 42.41 95.91 110.89 134.27 28.09 295.36 118.33 49.45 73.93 59.44 255.23 78.15 93.95 178.57 125.13 119.48 240.99 211.36 218.24 140.80 148.75 115.88 82.94 206.36 151.93 147.06 166.86 153.62 152.03 183.95 123.22 171.82 55.54 154.46 542.60 162.00 231.56 168.64 365.49 134.79 542.60 DHTsj 0.00 0.00 0.00 66.77 542.60 0.87 3.75 2.86 4.55 3.63 164.87 542.22 66.40 436.68 0.00 28.53 160.72 192.64 467.39 44.78 101.35 118.25 144.72 29.21 335.89 132.16 49.67 84.79 85.03 365.49 112.72 103.58 231.67 132.05 162.06 250.47 256.76 234.39 147.32 178.34 138.43 105.50 254.18 188.20 167.64 180.87 176.40 187.37 231.56 150.60 202.10 63.72 39.94 88.60 75.95 88.60 43.93 71.81 27.96 69.16 Dhj 0.00 0.00 0.00 0.69 26.94 13.84 32.29 27.65 32.98 31.29 3.02 18.76 59.45 15.85 0.00 62.43 58.46 31.80 36.71 69.16 65.94 66.70 66.14 64.77 55.92 71.81 5.31 9.38 4.96 22.54 12.24 9.63 23.87 62.27 15.86 38.33 25.00 51.75 57.85 53.24 44.52 54.07 66.01 49.20 68.23 70.48 66.62 49.31 64.82 75.57 74.81 88.60 21.06 54.51 26.17 39.20 27.47 54.51 11.84 39.56 Hybrid 0.00 0.00 0.00 0.67 7.97 1.31 8.02 3.79 4.23 3.97 0.88 7.45 22.67 5.24 0.00 25.66 27.94 32.53 20.32 39.56 36.35 39.40 41.37 6.86 24.74 54.51 49.23 7.75 2.41 23.87 5.98 9.14 12.99 32.01 7.09 36.54 14.86 40.61 22.85 27.61 31.49 40.93 32.31 24.53 41.81 44.96 40.49 25.30 18.26 39.20 38.58 8.65 52.78 96.72 74.72 85.55 46.51 91.70 56.65 96.72 gm2hat 0.00 0.00 0.00 1.37 2.65 93.03 96.72 94.60 94.98 93.20 14.45 58.02 95.12 40.58 0.51 90.47 90.86 77.15 67.18 91.23 87.53 87.24 87.95 59.20 70.67 91.70 6.83 12.25 6.43 27.14 15.35 11.20 27.52 41.98 17.79 43.05 26.96 55.38 30.19 56.65 47.31 56.77 63.30 50.70 70.43 72.61 68.57 50.60 63.04 75.99 74.31 85.55 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 -10.93 -39.79 Duj1 0.00 0.00 0.00 0.00 -0.03 -1.16 -3.87 -2.93 -4.11 -3.89 -0.19 -6.38 -21.96 -4.93 0.00 -26.35 -28.01 -28.00 -20.51 -39.79 -37.52 -40.58 -42.05 -29.84 -32.67 -54.07 -3.28 -6.48 -2.45 -11.17 -8.49 -6.14 -15.57 -32.80 -10.08 -29.40 -14.38 -40.76 -24.17 -43.73 -37.61 -47.94 -49.84 -40.16 -58.77 -61.00 -57.91 -40.07 -51.27 -65.65 -62.26 -76.47 -32.34 -61.00 -11.78 -42.31 Dsj1 0.00 0.00 0.00 0.00 0.04 -1.22 -4.01 -3.05 -4.26 -4.06 -0.32 -8.24 -22.68 -6.65 0.00 -27.91 -29.87 -28.00 -24.22 -42.31 -40.72 -44.21 -45.68 -33.79 -39.35 -57.54 -3.28 -6.48 -2.45 -11.89 -8.50 -6.14 -15.57 -35.43 -10.12 -29.40 -14.38 -40.76 -26.71 -43.76 -37.64 -47.96 -54.19 -40.17 -58.79 -61.00 -57.94 -40.09 -53.09 -65.89 -63.29 -81.21 677.18 1125.89 74.62 261.47 -8.38 -31.47 Duj2 0.00 0.00 0.00 0.00 0.48 -0.97 -3.46 -2.57 -3.68 -3.33 0.05 -3.81 -19.92 -1.37 0.00 -20.90 -22.68 -24.68 -11.34 -31.47 -26.42 -27.98 -30.19 19.12 -2.27 -43.85 4.88 3.29 5.41 21.14 6.00 17.17 40.04 173.44 40.71 46.62 132.39 110.46 186.15 83.32 73.71 49.73 202.17 201.07 138.09 151.05 151.60 261.47 665.09 536.21 1125.89 381.51 -5.76 -43.85 7.78 25.86 -7.27 -43.85 -6.40 -31.47 Duj2a 0.00 0.00 0.00 0.00 0.48 -0.97 -3.46 -2.57 -3.68 -3.33 0.05 -3.81 -19.92 -1.37 0.00 -20.90 -22.68 17.00 -11.34 -31.47 -26.42 -27.98 -30.19 8.02 -2.29 -43.85 18.42 -2.82 0.05 3.26 -4.04 -1.40 -3.34 -25.55 -1.83 -6.94 18.59 -2.87 -19.43 -2.73 -15.99 -28.34 -6.56 -1.29 -7.88 1.13 -9.94 -0.52 -3.78 -8.82 17.84 25.86 -7.26 280.63 24.90 280.63 -10.44 186.15 -9.31 -31.88 Dsj2 0.00 0.00 0.00 0.00 0.50 -0.98 -3.47 -2.57 -3.68 -3.34 -0.31 -6.36 -20.01 -5.64 0.00 -21.06 -23.12 -28.00 -18.34 -31.88 -27.20 -29.09 -31.53 18.35 -11.42 -45.63 -3.28 -6.48 -2.45 -11.88 -8.50 -6.14 -15.57 173.41 -10.12 -29.40 -14.38 -40.76 186.15 -43.76 -37.64 -47.96 -5.75 -40.17 -58.79 -61.00 -57.94 -40.09 -51.85 -65.89 -63.29 280.63 25.44 200.49 11.57 27.09 25.41 107.16 28.12 200.49 DSh 0.00 0.00 -0.00 0.02 200.49 0.67 2.96 2.28 3.64 2.86 1.42 39.52 48.24 20.32 0.00 19.43 95.48 17.00 51.05 25.75 59.49 63.90 70.90 19.54 107.16 43.60 20.15 1.73 2.12 12.11 -0.59 3.52 7.70 78.56 5.82 20.36 32.16 26.59 95.08 13.88 -2.69 -16.20 24.00 10.16 7.52 24.43 3.37 11.13 12.19 1.31 27.09 5.67 -23.20 -73.13 -58.78 -73.13 -28.61 -53.88 -9.47 -44.83 DSh2 -0.00 0.00 0.00 0.00 -0.11 0.67 2.96 2.28 3.64 2.86 -0.17 -6.26 -14.64 -4.67 0.00 -31.32 -25.33 -23.91 -20.17 -44.83 -39.85 -37.53 -40.22 -31.23 -32.71 -53.88 -1.15 -5.74 -2.04 -9.71 -7.78 -5.26 -13.46 -32.21 -8.67 -24.88 -10.15 -34.64 -26.43 -38.52 -34.46 -45.08 -47.45 -35.59 -52.76 -53.23 -52.36 -35.43 -47.15 -59.78 -55.07 -73.13 25.65 264.38 3.10 18.47 35.20 264.38 17.66 130.80 DSh3 0.00 -0.00 -0.00 0.00 130.80 0.59 2.62 2.03 3.26 2.53 0.04 8.58 41.33 1.78 0.00 15.77 73.83 17.00 8.04 18.33 44.43 45.04 46.85 15.77 54.90 17.24 264.38 8.73 2.46 1.13 -0.87 10.63 -3.43 60.22 6.92 47.47 41.65 38.63 77.14 41.49 23.94 -2.31 -8.44 31.16 16.37 62.39 17.50 33.48 -3.77 -1.45 18.47 -0.86 55.64 253.36 69.23 92.74 65.79 172.39 40.01 253.36 DHTj 0.00 0.00 0.00 0.10 253.36 0.73 3.18 2.44 3.89 3.08 4.50 83.31 52.94 58.09 0.00 22.06 111.81 17.00 120.48 31.51 71.70 79.95 92.06 22.62 172.39 69.41 20.50 19.42 12.98 46.28 19.90 27.72 58.98 92.61 39.86 80.06 83.82 114.30 107.92 76.73 52.30 34.93 100.57 64.91 71.51 82.52 64.90 67.27 86.59 63.76 92.74 33.84 70.56 317.77 87.89 114.95 81.72 208.23 52.92 317.77 DHTsj 0.00 0.00 0.00 0.32 317.77 0.77 3.34 2.56 4.07 3.23 13.48 146.14 56.59 104.97 0.00 23.81 124.32 17.07 178.68 35.00 79.30 89.58 105.01 24.38 208.23 84.07 21.40 24.08 21.15 85.67 31.67 31.70 83.42 102.76 59.26 92.30 114.22 127.52 118.24 96.32 65.15 47.19 131.22 87.57 87.39 96.03 81.14 89.68 114.80 81.09 114.95 40.72 -31.84 -81.30 -65.89 -81.30 -35.56 -63.40 -20.58 -59.08 Dhj 0.00 0.00 0.00 0.00 -14.46 -7.44 -19.43 -16.84 -21.94 -20.24 -0.32 -9.48 -48.51 -6.88 0.00 -51.37 -47.44 -28.00 -25.44 -59.08 -55.33 -56.25 -55.91 -53.19 -43.01 -63.40 -3.28 -6.48 -2.45 -11.89 -8.50 -6.14 -15.57 -51.48 -10.12 -29.40 -14.38 -40.76 -46.21 -43.76 -37.64 -47.96 -54.30 -40.17 -58.79 -61.00 -57.94 -40.09 -53.09 -65.89 -63.29 -81.30 -6.74 -43.85 3.10 18.47 -6.91 -43.85 -8.38 -31.47 Hybrid 0.00 0.00 0.00 0.00 0.48 -0.97 -3.46 -2.57 -3.68 -3.33 0.05 -3.81 -19.92 -1.37 0.00 -20.90 -22.68 -24.68 -11.34 -31.47 -26.42 -27.98 -30.19 17.77 -2.27 -43.85 18.42 -2.82 0.05 3.26 -4.04 -1.40 -3.34 -25.55 -1.83 -6.94 18.59 -2.87 -19.43 -2.73 -15.99 -28.34 -6.56 -1.29 -7.88 1.13 -9.94 -0.52 -3.77 -1.45 18.47 -0.86 -44.41 -90.59 -64.41 -76.88 -37.98 -83.12 -48.87 -90.59 gm2hat 0.00 0.00 0.00 0.28 0.88 -86.11 -89.40 -88.93 -90.00 -86.81 -1.89 -33.99 -90.59 -18.53 0.01 -82.57 -81.98 -68.07 -48.47 -82.92 -77.26 -76.55 -77.43 -49.62 -52.72 -83.12 -4.23 -8.44 -3.21 -14.48 -10.60 -7.14 -17.96 -36.98 -11.37 -33.02 -15.50 -43.64 -25.16 -46.52 -39.96 -50.35 -51.61 -41.45 -60.66 -62.84 -59.62 -41.18 -51.73 -66.28 -62.77 -76.88 Performance Measure: Bias (%) Sample size: 10.0% gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 3.20 17.61 -31.16 -61.00 -65.87 -81.21 87.45 1125.89 (50 <= gamma2 < inf) Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: 51.27 639.24 -63.91 -76.47 -26.62 -81.21 Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Average: Maximum: 45.14 73.54 -25.51 -76.47 Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B 50) Average: Maximum: 31.39 639.24 (1.0 <= gamma2 < Average: Maximum: 38 39 (50 <= gamma2 < inf) 50) 45.14 73.54 31.39 639.24 Average: Maximum: 51.27 639.24 3.20 17.61 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 Average: Maximum: Average: Maximum: gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 (1.0 <= gamma2 < Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B Performance Measure: RMS Error (%) Sample size: 10.0% 25.79 76.47 63.92 76.47 31.41 61.27 11.30 39.80 Duj1 0.00 0.00 0.00 0.01 2.88 1.18 5.35 3.08 4.14 3.90 0.26 6.38 22.21 5.00 0.00 26.37 28.04 30.55 20.56 39.80 37.53 40.58 42.05 29.84 32.67 54.08 4.07 6.88 2.56 11.33 8.56 6.82 16.02 32.80 10.11 30.98 14.73 41.77 24.17 43.78 37.77 47.96 49.84 40.20 58.95 61.27 58.09 40.13 51.27 65.68 62.27 76.47 26.89 81.21 65.88 81.21 32.59 61.28 12.14 42.32 Dsj1 0.00 0.00 0.00 0.00 2.76 1.24 5.50 3.20 4.28 4.08 0.36 8.24 22.92 6.69 0.00 27.93 29.88 30.55 24.26 42.32 40.72 44.21 45.68 33.80 39.35 57.54 4.07 6.88 2.57 12.02 8.57 6.82 16.03 35.44 10.14 30.98 14.74 41.77 26.71 43.81 37.80 47.99 54.19 40.22 58.97 61.28 58.11 40.14 53.09 65.93 63.30 81.21 103.38 1133.61 682.55 1133.61 90.96 267.08 9.05 31.73 Duj2 0.00 0.00 0.00 0.01 3.28 1.01 4.91 2.72 3.70 3.35 0.22 3.81 20.23 1.78 0.00 20.95 22.72 31.73 11.61 31.49 26.43 27.99 30.19 19.70 2.37 43.85 10.52 10.78 6.32 22.99 8.05 26.66 46.81 173.44 41.14 87.80 139.37 147.35 186.15 86.77 85.05 52.65 202.76 205.67 160.67 190.26 173.42 267.08 665.37 549.70 1133.61 381.51 16.71 120.14 17.82 27.30 19.22 48.12 13.26 120.14 Duj2a 0.00 0.00 0.00 0.01 3.28 1.01 4.91 2.72 3.70 3.35 0.22 3.81 20.23 1.78 0.00 20.95 22.72 120.14 11.61 31.49 26.43 27.99 30.19 9.85 2.39 43.85 30.18 6.48 1.55 6.02 4.73 6.46 9.33 25.55 2.41 37.63 23.01 36.24 19.44 11.64 20.60 28.85 6.95 10.13 30.53 48.12 28.57 10.38 4.12 17.55 22.33 27.30 32.94 281.98 115.77 281.98 38.74 186.15 9.71 31.90 Dsj2 0.00 0.00 0.00 0.00 3.14 1.01 4.92 2.73 3.71 3.36 0.36 6.36 20.31 5.70 0.00 21.11 23.17 30.55 18.42 31.90 27.21 29.09 31.53 18.95 11.43 45.63 4.07 6.88 2.57 12.02 8.57 6.82 16.03 173.42 10.14 30.98 14.74 41.77 186.15 43.81 37.80 47.99 6.84 40.22 58.97 61.28 58.11 40.14 51.85 65.93 63.30 281.98 32.71 200.79 15.50 28.97 34.96 107.16 33.09 200.79 DSh 0.00 0.00 0.00 0.12 200.79 0.67 3.02 2.28 3.64 2.86 1.53 39.52 48.28 20.50 0.00 19.44 95.50 120.14 51.45 25.76 59.49 63.90 70.90 19.54 107.16 43.61 30.84 10.24 3.13 13.96 3.65 11.00 14.94 78.56 6.32 58.77 36.24 54.79 95.08 18.92 17.47 17.74 24.15 15.24 30.90 52.92 27.92 16.09 12.32 14.45 28.97 6.27 24.18 73.14 58.82 73.14 29.16 54.03 11.19 44.83 DSh2 0.00 0.00 0.00 0.01 2.55 0.67 3.02 2.28 3.64 2.86 0.24 6.26 14.86 4.74 0.00 31.32 25.34 32.32 20.23 44.83 39.85 37.55 40.22 31.23 32.72 53.88 4.21 6.40 2.21 9.93 7.89 6.28 14.16 32.21 8.71 28.30 10.97 36.78 26.43 38.63 34.77 45.12 47.46 35.69 53.19 54.03 52.75 35.55 47.16 59.87 55.11 73.14 36.09 357.43 11.51 21.81 50.17 357.43 22.68 131.15 DSh3 0.00 0.00 0.00 0.02 131.15 0.59 2.73 2.04 3.26 2.53 0.21 8.59 41.41 2.23 0.00 15.78 73.86 120.14 8.91 18.35 44.43 45.05 46.86 15.78 54.91 17.28 357.43 22.36 3.66 4.93 4.01 21.60 9.87 60.23 7.53 107.65 47.58 75.03 77.15 47.55 42.01 12.19 8.71 36.67 43.68 100.21 45.95 38.75 4.16 16.96 21.81 3.11 61.69 253.68 70.53 93.87 73.33 172.40 45.05 253.68 DHTj 0.00 0.00 0.00 0.59 253.68 0.73 3.22 2.44 3.89 3.08 5.04 83.32 52.96 58.64 0.00 22.07 111.82 120.14 121.15 31.52 71.70 79.95 92.07 22.62 172.40 69.42 31.08 32.43 15.09 50.13 23.53 40.75 69.18 92.61 40.50 122.63 90.80 141.58 107.92 79.72 61.91 37.80 100.67 68.11 85.04 103.19 78.42 70.27 86.66 67.62 93.87 33.96 76.09 318.00 88.86 115.81 88.30 208.23 57.95 318.00 DHTsj 0.00 0.00 0.00 1.08 318.00 0.77 3.37 2.56 4.07 3.23 14.04 146.14 56.61 105.34 0.00 23.81 124.33 120.14 179.12 35.01 79.30 89.58 105.02 24.38 208.23 84.07 31.78 36.52 22.79 88.30 34.21 44.12 91.30 102.76 59.71 132.99 119.25 152.70 118.24 98.61 73.07 49.30 131.30 89.84 98.70 113.53 91.89 91.92 114.84 83.97 115.81 40.81 32.06 81.30 65.91 81.30 35.80 63.40 20.81 59.08 Dhj 0.00 0.00 0.00 0.00 14.60 7.49 21.19 17.00 21.96 20.25 0.37 9.48 48.55 6.92 0.00 51.38 47.44 30.55 25.46 59.08 55.33 56.25 55.91 53.19 43.01 63.40 4.07 6.88 2.57 12.02 8.57 6.82 16.03 51.48 10.14 30.98 14.74 41.77 46.22 43.81 37.80 47.99 54.30 40.22 58.97 61.28 58.11 40.14 53.09 65.93 63.30 81.30 14.69 48.12 11.51 21.81 19.55 48.12 9.05 31.73 Hybrid 0.00 0.00 0.00 0.01 3.28 1.01 4.91 2.72 3.70 3.35 0.22 3.81 20.23 1.78 0.00 20.95 22.72 31.73 11.61 31.49 26.43 27.99 30.19 18.52 2.37 43.85 30.18 6.48 1.55 6.02 4.73 6.46 9.33 25.55 2.41 37.63 23.01 36.24 19.44 11.64 20.60 28.85 6.95 10.13 30.53 48.12 28.57 10.38 4.16 16.96 21.81 3.11 44.93 90.68 64.43 76.89 38.34 83.12 49.68 90.68 gm2hat 0.00 0.00 0.00 0.58 1.73 86.27 90.31 89.06 90.01 86.85 6.23 34.01 90.68 19.35 0.36 82.64 82.02 74.31 48.77 82.93 77.26 76.55 77.43 49.85 52.72 83.12 5.34 8.95 3.38 15.23 10.71 7.94 18.49 37.42 11.42 34.79 15.88 44.71 26.08 46.57 40.13 50.37 51.68 41.50 60.85 63.13 59.80 41.24 51.73 66.32 62.79 76.89 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 -8.57 -33.01 Duj1 0.00 0.00 0.00 0.00 0.12 -0.99 -3.41 -2.66 -3.68 -3.37 -0.01 -2.43 -19.47 -1.50 0.00 -22.22 -22.70 -22.67 -12.06 -33.01 -29.89 -31.83 -32.79 -23.64 -19.23 -43.88 -1.62 -4.88 -1.24 -6.86 -5.71 -3.81 -9.00 -26.76 -5.74 -19.10 -5.97 -27.94 -19.91 -31.91 -29.05 -39.45 -36.42 -29.91 -46.54 -46.19 -45.45 -29.50 -38.07 -52.58 -46.73 -62.96 -24.49 -49.73 -9.55 -37.27 Dsj1 0.00 0.00 0.00 0.00 0.12 -1.10 -3.69 -2.89 -3.96 -3.68 -0.01 -3.20 -20.87 -1.86 0.00 -24.77 -25.77 -22.67 -14.31 -37.27 -34.52 -37.10 -38.24 -27.77 -25.81 -49.73 -1.62 -4.88 -1.24 -7.01 -5.71 -3.81 -9.00 -30.41 -5.76 -19.10 -5.97 -27.94 -22.56 -31.93 -29.07 -39.46 -39.54 -29.91 -46.56 -46.19 -45.47 -29.51 -39.26 -52.79 -47.58 -69.06 1087.89 2003.12 112.39 362.12 -4.99 -17.83 Duj2 0.00 0.00 0.00 0.00 0.25 -0.65 -2.61 -1.98 -2.86 -2.40 0.00 -1.23 -15.73 -0.17 0.00 -13.05 -14.15 -16.24 -4.24 -17.83 -11.92 -11.12 -12.91 33.89 19.30 -23.64 5.06 0.39 2.77 7.61 5.26 10.93 36.46 173.47 29.01 79.01 95.24 161.34 186.15 151.13 131.47 110.95 290.87 276.82 269.05 340.60 303.29 362.12 940.41 1026.51 2003.12 381.51 3.78 83.17 35.11 83.17 4.66 58.20 -3.33 18.67 Duj2a 0.00 0.00 0.00 0.00 0.25 -0.65 -2.61 -1.98 -2.86 -2.40 0.00 -1.23 -15.73 -0.17 0.00 -13.05 -14.15 18.67 -4.24 -17.83 -11.92 -11.12 -12.91 14.14 18.83 -23.64 4.08 -2.43 -0.29 -2.93 -3.40 -0.90 -1.28 -15.80 -0.44 13.27 11.00 13.51 -13.89 22.76 2.57 -8.45 12.08 9.09 16.10 58.20 15.68 12.05 6.70 13.17 37.38 83.17 0.38 381.51 60.47 381.51 -3.41 186.15 -6.20 -22.67 Dsj2 0.00 0.00 0.00 0.00 0.25 -0.67 -2.64 -2.01 -2.89 -2.44 -0.01 -3.06 -16.08 -1.85 0.00 -13.60 -15.68 -22.67 -13.20 -19.32 -14.36 -14.64 -17.34 32.63 -1.62 -30.62 -1.62 -4.88 -1.24 -7.01 -5.71 -3.81 -9.00 173.47 -5.76 -19.10 -5.97 -27.94 186.15 -31.93 -29.07 -39.46 -29.97 -29.91 -46.56 -46.19 -45.47 -29.51 -39.26 -52.79 -47.58 381.51 10.70 49.20 5.03 13.90 12.09 49.20 9.99 45.86 DSh 0.00 0.00 0.00 0.00 43.52 0.44 2.01 1.53 2.51 1.93 0.02 3.19 29.19 1.81 0.00 10.71 45.86 18.67 7.08 11.20 30.15 30.51 30.69 11.21 43.39 10.38 6.10 -1.70 0.08 -2.64 -1.84 1.32 3.30 37.85 1.10 22.98 8.76 26.43 49.20 10.52 4.27 -6.78 7.53 1.51 4.96 19.23 4.48 3.50 3.22 0.62 13.90 2.37 -16.45 -56.71 -42.53 -56.71 -20.13 -43.38 -6.75 -28.38 DSh2 -0.00 0.00 -0.00 0.00 0.11 0.44 2.01 1.53 2.51 1.93 -0.01 -2.21 -15.36 -1.25 0.00 -28.38 -22.55 -15.78 -10.95 -27.06 -26.74 -32.00 -33.41 -28.05 -16.77 -43.38 -0.33 -4.35 -1.02 -6.28 -5.07 -2.95 -6.95 -28.07 -4.61 -12.09 -3.52 -18.88 -22.48 -24.85 -23.51 -34.02 -31.71 -24.68 -37.97 -35.29 -37.14 -24.01 -32.18 -43.88 -37.33 -56.71 7.65 49.34 1.72 8.23 10.02 49.34 5.73 28.17 DSh3 -0.00 -0.00 0.00 0.00 20.81 0.33 1.55 1.19 1.99 1.49 -0.00 -0.72 21.34 -0.27 0.00 6.64 28.17 18.67 -2.94 4.28 17.83 16.53 14.40 7.31 22.65 -4.97 5.25 -0.83 -0.35 -4.84 -2.72 3.42 2.22 22.51 0.25 41.20 5.09 49.34 32.89 11.65 10.83 0.75 0.04 -1.89 8.34 22.98 8.09 0.36 -1.64 0.17 8.23 0.11 22.79 76.21 30.22 42.59 27.81 76.21 14.93 62.53 DHTj 0.00 0.00 0.00 0.00 62.53 0.53 2.35 1.79 2.90 2.26 0.08 12.41 35.30 7.62 0.00 14.19 62.08 18.67 30.45 17.99 42.45 45.54 49.32 14.80 76.21 32.07 6.98 1.43 2.59 5.25 4.67 5.64 17.52 52.17 9.20 39.11 20.55 42.51 62.89 34.46 25.36 13.27 38.29 24.78 28.48 41.71 28.62 27.33 33.51 27.52 42.59 17.26 30.72 103.25 41.25 56.84 37.01 101.60 20.61 103.25 DHTsj 0.00 0.00 0.00 0.00 103.25 0.59 2.61 1.99 3.21 2.52 0.31 25.39 40.67 16.48 0.00 16.61 76.25 18.73 51.14 22.02 51.05 55.53 61.52 17.20 101.60 42.64 8.11 2.93 4.51 12.96 8.30 7.31 26.64 63.81 16.02 47.41 33.21 56.43 75.88 46.03 32.20 19.47 55.11 35.68 38.94 52.81 38.81 38.32 48.07 37.74 56.84 22.36 -23.33 -69.12 -52.19 -69.12 -26.24 -52.10 -14.08 -46.60 Dhj 0.00 0.00 0.00 0.00 -4.79 -3.71 -11.73 -9.60 -12.97 -11.61 -0.01 -3.24 -36.35 -1.86 0.00 -38.46 -35.01 -22.67 -14.40 -46.60 -42.69 -43.56 -43.34 -39.77 -27.24 -52.10 -1.62 -4.88 -1.24 -7.01 -5.71 -3.81 -9.00 -39.15 -5.76 -19.10 -5.97 -27.94 -33.67 -31.93 -29.07 -39.46 -39.55 -29.91 -46.56 -46.19 -45.47 -29.51 -39.26 -52.79 -47.58 -69.12 0.65 58.20 1.72 8.23 4.87 58.20 -4.99 -17.83 Hybrid 0.00 0.00 0.00 0.00 0.25 -0.65 -2.61 -1.98 -2.86 -2.40 0.00 -1.23 -15.73 -0.17 0.00 -13.05 -14.15 -16.24 -4.24 -17.83 -11.92 -11.12 -12.91 19.76 18.83 -23.64 4.08 -2.43 -0.29 -2.93 -3.40 -0.90 -1.28 -15.80 -0.44 13.27 11.00 13.51 -13.89 22.76 2.57 -8.45 12.08 9.09 16.10 58.20 15.68 12.05 -1.64 0.17 8.23 0.11 -34.58 -81.00 -50.49 -63.36 -28.23 -67.41 -39.71 -81.00 gm2hat 0.00 0.00 0.00 0.16 0.48 -72.86 -78.61 -77.92 -80.05 -75.63 -0.02 -13.04 -81.00 -5.60 0.01 -69.47 -66.64 -55.18 -28.33 -68.64 -61.56 -60.13 -60.41 -39.40 -31.09 -67.41 -2.15 -6.35 -1.63 -8.80 -7.15 -4.41 -10.39 -30.23 -6.46 -21.45 -6.47 -29.91 -20.84 -33.94 -30.83 -41.43 -37.78 -30.87 -48.03 -47.59 -46.77 -30.31 -38.42 -53.08 -47.09 -63.36 Performance Measure: Bias (%) Sample size: 20.0% gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 3.20 17.61 -23.12 -46.54 -52.17 -69.06 140.02 2003.12 (50 <= gamma2 < inf) Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: 51.27 639.24 -50.09 -62.96 -20.59 -69.06 Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Average: Maximum: 45.14 73.54 -19.32 -62.96 Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B 50) Average: Maximum: 31.39 639.24 (1.0 <= gamma2 < Average: Maximum: 40 41 (50 <= gamma2 < inf) 50) 45.14 73.54 31.39 639.24 Average: Maximum: 51.27 639.24 3.20 17.61 skew 0.00 0.00 0.00 0.00 0.00 17.61 5.64 6.87 4.70 9.95 0.50 0.24 1.18 0.81 -0.67 7.41 1.92 0.08 1.25 4.84 4.77 4.38 3.53 114.62 4.51 2.71 4.29 1.66 2.97 3.79 3.06 3.26 2.70 639.24 7.70 2.66 6.43 3.73 454.61 7.36 7.14 7.89 54.66 10.35 7.04 5.82 7.22 11.06 69.00 14.60 23.44 73.54 Average: Maximum: Average: Maximum: gamma2 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.05 0.05 0.18 0.23 0.31 0.37 0.38 0.47 0.52 0.70 0.75 0.93 0.94 1.13 1.19 1.53 1.63 1.87 3.22 3.33 3.35 3.76 3.90 6.30 6.55 7.60 7.99 8.12 12.96 14.27 15.68 16.16 16.98 19.70 24.17 30.85 31.71 33.03 34.68 37.75 81.63 114.38 166.18 234.81 (1.0 <= gamma2 < Nclass 15469 1288928 624473 150 1500 591564 9595 110074 591564 1288927 874 51168 19000 906 3 110076 36000 3 930 109688 624472 624473 202462 437654 110076 100000 29 33 194 131 168 21 49 535328 909 10 73 17 221480 213 72 398 6155 235 61 37 62 233 14047 247 772 10384 Average: Maximum: Psize 15469 1288928 624473 15000 15000 597382 10000 113600 621498 1341544 82135 1547606 33750 41197 1547606 147811 111500 113600 20213 173805 1463974 1654700 633756 597382 931174 330000 931174 1547606 1547606 15469 624473 1547606 1547606 1463974 1547606 1463974 931174 597382 633756 633756 173805 931174 113600 1654700 173805 1341544 147811 1463974 624473 50000 50000 50000 (0.0 <= gamma2 < 1.0) Name DB01 DB02 DB03 EQ100 EQ10 DB04 GOOD DB05 DB06 DB07 NGB/4 DB08 FRAME2 NGB/2 DB09 DB10 FRAME3 DB11 NGB/1 DB12 DB13 DB14 DB15 DB16 DB17 SUDM DB18 DB19 DB20 DB21 DB22 DB23 DB24 DB25 DB26 DB27 DB28 DB29 DB30 DB31 DB32 DB33 DB34 DB35 DB36 DB37 DB38 DB39 DB40 Z20A Z15 Z20B Performance Measure: RMS Error (%) Sample size: 20.0% 19.62 62.96 50.10 62.96 23.44 46.77 8.89 33.01 Duj1 0.00 0.00 0.00 0.00 1.02 0.99 3.86 2.69 3.68 3.37 0.04 2.44 19.52 1.56 0.00 22.23 22.71 27.49 12.10 33.01 29.89 31.83 32.79 23.65 19.23 43.88 2.65 5.24 1.34 6.97 5.81 4.67 9.61 26.76 5.77 21.24 6.43 29.29 19.91 32.01 29.34 39.49 36.42 29.98 46.77 46.63 45.70 29.58 38.07 52.63 46.76 62.96 20.88 69.06 52.19 69.06 24.81 49.73 9.86 37.28 Dsj1 0.00 0.00 0.00 0.00 0.98 1.10 4.14 2.92 3.97 3.68 0.04 3.20 20.92 1.90 0.00 24.77 25.78 27.49 14.34 37.28 34.52 37.11 38.24 27.77 25.82 49.73 2.65 5.24 1.34 7.11 5.81 4.67 9.61 30.41 5.78 21.24 6.44 29.29 22.57 32.02 29.36 39.50 39.54 29.99 46.79 46.63 45.72 29.59 39.26 52.84 47.60 69.06 150.28 2010.61 1093.07 2010.61 123.00 369.77 5.77 29.82 Duj2 0.00 0.00 0.00 0.00 1.13 0.66 3.09 2.01 2.87 2.40 0.04 1.24 15.81 0.56 0.00 13.08 14.18 29.82 4.55 17.84 11.92 11.12 12.91 34.01 19.31 23.65 9.22 8.16 3.61 9.71 7.05 22.87 41.51 173.47 29.36 105.63 102.01 187.15 186.15 154.43 140.80 113.29 291.28 281.68 289.60 369.77 325.72 367.45 940.73 1039.43 2010.61 381.51 15.20 83.69 37.30 83.69 17.44 76.57 8.12 79.16 Duj2a 0.00 0.00 0.00 0.00 1.13 0.66 3.09 2.01 2.87 2.40 0.04 1.24 15.81 0.56 0.00 13.08 14.18 79.16 4.55 17.84 11.92 11.12 12.91 14.32 18.84 23.65 10.28 5.63 0.94 3.84 3.86 6.18 6.18 15.80 1.23 35.57 13.84 33.61 13.90 26.26 16.11 11.07 12.21 13.33 34.66 76.57 33.26 15.62 6.83 19.98 38.68 83.69 29.69 381.51 130.30 381.51 32.79 186.15 6.53 27.49 Dsj2 0.00 0.00 0.00 0.00 1.04 0.68 3.12 2.04 2.90 2.44 0.04 3.06 16.15 1.89 0.00 13.63 15.70 27.49 13.24 19.34 14.36 14.64 17.35 32.79 1.67 30.62 2.65 5.24 1.34 7.11 5.81 4.67 9.61 173.47 5.78 21.24 6.44 29.29 186.15 32.02 29.36 39.50 29.98 29.99 46.79 46.63 45.72 29.59 39.26 52.84 47.60 381.51 15.23 79.16 7.73 15.12 18.14 49.20 12.91 79.16 DSh 0.00 0.00 0.00 0.00 43.60 0.44 2.08 1.54 2.51 1.93 0.06 3.19 29.22 1.97 0.00 10.72 45.88 79.16 7.46 11.21 30.15 30.51 30.70 11.21 43.40 10.40 11.32 6.12 1.11 3.68 3.12 8.97 8.49 37.85 1.75 43.31 11.39 41.68 49.20 13.98 14.66 9.45 7.68 7.60 21.33 32.27 20.49 8.17 3.41 9.37 15.12 3.03 17.47 56.72 42.58 56.72 20.88 43.38 8.30 30.14 DSh2 0.00 0.00 0.00 0.00 0.99 0.44 2.08 1.54 2.51 1.93 0.04 2.21 15.38 1.32 0.00 28.38 22.56 30.14 11.01 27.06 26.74 32.00 33.41 28.05 16.77 43.38 2.92 5.01 1.18 6.43 5.22 4.60 8.00 28.07 4.66 17.80 4.59 22.13 22.48 25.09 24.18 34.11 31.71 24.84 38.62 36.50 37.81 24.18 32.19 44.02 37.39 56.72 13.44 79.16 6.32 10.62 17.91 74.99 9.05 79.16 DSh3 0.00 0.00 0.00 0.00 20.90 0.33 1.69 1.20 1.99 1.49 0.04 0.73 21.39 0.60 0.00 6.65 28.20 79.16 3.43 4.31 17.84 16.53 14.41 7.31 22.65 5.01 10.76 7.79 1.03 5.20 3.66 13.91 8.41 22.51 1.41 74.99 8.33 71.90 32.89 15.94 21.07 8.98 1.54 8.18 26.77 38.65 25.80 7.96 2.03 10.62 10.47 2.15 26.24 79.16 30.99 43.23 32.05 76.22 17.86 79.16 DHTj 0.00 0.00 0.00 0.00 62.66 0.53 2.39 1.79 2.90 2.26 0.20 12.41 35.32 7.93 0.00 14.20 62.10 79.16 30.75 17.99 42.45 45.54 49.32 14.80 76.22 32.08 12.07 9.77 3.78 8.38 7.02 15.15 22.33 52.17 9.58 58.57 24.32 56.68 62.89 36.48 30.92 15.62 38.35 27.07 38.15 50.75 38.19 29.27 33.55 29.81 43.23 17.36 33.76 103.33 41.81 57.28 40.54 101.60 23.52 103.33 DHTsj 0.00 0.00 0.00 0.00 103.33 0.59 2.64 1.99 3.21 2.52 0.47 25.39 40.69 16.68 0.00 16.61 76.26 79.16 51.34 22.02 51.05 55.53 61.52 17.21 101.60 42.64 12.97 11.04 5.40 14.80 9.92 16.63 30.43 63.81 16.26 65.09 35.85 67.79 75.88 47.54 36.96 21.15 55.15 37.22 46.20 60.33 45.90 39.69 48.10 39.42 57.28 22.44 23.60 69.12 52.20 69.12 26.56 52.10 14.34 46.61 Dhj 0.00 0.00 0.00 0.00 4.87 3.72 12.16 9.62 12.98 11.61 0.04 3.25 36.36 1.91 0.00 38.47 35.01 27.49 14.43 46.61 42.69 43.56 43.34 39.77 27.24 52.10 2.65 5.24 1.34 7.11 5.81 4.67 9.61 39.15 5.78 21.24 6.44 29.29 33.67 32.02 29.36 39.50 39.55 29.99 46.79 46.63 45.72 29.59 39.26 52.84 47.60 69.12 12.00 76.57 6.32 10.62 17.69 76.57 5.77 29.82 Hybrid 0.00 0.00 0.00 0.00 1.13 0.66 3.09 2.01 2.87 2.40 0.04 1.24 15.81 0.56 0.00 13.08 14.18 29.82 4.55 17.84 11.92 11.12 12.91 21.24 18.84 23.65 10.28 5.63 0.94 3.84 3.86 6.18 6.18 15.80 1.23 35.57 13.84 33.61 13.90 26.26 16.11 11.07 12.21 13.33 34.66 76.57 33.26 15.62 2.03 10.62 10.47 2.15 35.18 81.03 50.51 63.37 28.65 67.42 40.65 81.03 gm2hat 0.00 0.00 0.00 0.29 0.94 72.95 79.17 78.00 80.05 75.65 3.86 13.06 81.03 6.57 0.25 69.52 66.66 66.90 28.63 68.65 61.56 60.13 60.41 39.59 31.10 67.42 3.48 6.81 1.77 9.47 7.28 5.42 11.09 30.43 6.50 23.85 6.97 31.35 21.39 34.05 31.13 41.47 37.86 30.95 48.26 48.05 47.04 30.39 38.43 53.14 47.11 63.37

- Similar pages