MCDB-R: Risk Analysis in the Database Subi Arumugam1 Fei Xu3,∗ 1 University of Florida Gainesville, FL, USA {sa2,jampani}@cise.ufl.edu Ravi Jampani1 Christopher Jermaine2 2 Rice University Houston, TX, USA {lp6,cmj4}@rice.edu ABSTRACT Enterprises often need to assess and manage the risk arising from uncertainty in their data. Such uncertainty is typically modeled as a probability distribution over the uncertain data values, specified by means of a complex (often predictive) stochastic model. The probability distribution over data values leads to a probability distribution over database query results, and risk assessment amounts to exploration of the upper or lower tail of a query-result distribution. In this paper, we extend the Monte Carlo Database System to efficiently obtain a set of samples from the tail of a query-result distribution by adapting recent “Gibbs cloning” ideas from the simulation literature to a database setting. 1. INTRODUCTION In the face of regulatory processes such as Basel II and Solvency 2, enterprises are becoming increasingly concerned with managing and assessing the credit, financial, engineering, and operational risk arising from uncertain data [16]. Examples of uncertain data include future values of financial assets, customer order quantities under hypothetical price changes, and transportation times for future shipments under alternative shipping schemes. Data uncertainty is usually modeled as a probability distribution over possible data values, and such probabilities are often specified via complex stochastic models. E.g., we might specify the foregoing uncertain financial, retail, and logistics data sets using Euler approximations to stochastic differential equations, Bayesian demand models, and stochastic network models, respectively. This specification in turn leads to the representation of data uncertainty as a probability distribution over possible database instances (“possible DBs,” also called “possible worlds”). Running an aggregation query such as “select sum of sales” over uncertain data therefore does not yield a single, deterministic result for total sales; rather, there is a probability distribution over all possible query results. In this setting, risk assessment typically corresponds to computing interesting properties of the upper or lower tails of the queryresult distribution; for example, computing the probability of a ∗Fei Xu performed this work while at U. Florida. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were presented at The 36th International Conference on Very Large Data Bases, September 13-17, 2010, Singapore. Proceedings of the VLDB Endowment, Vol. 3, No. 1 Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00. 3 Microsoft Corporation Redmond, WA, USA [email protected] Luis L. Perez2 Peter J. Haas4 4 IBM Research - Almaden San Jose, CA, USA [email protected] large investment loss. The problem is made more complex by the fact that the tails of the distribution are hard to specify a priori, and so risk analysis frequently focuses on the inverse problem of determining an extreme quantile, e.g., determining the value γ such that there is a 0.1% probability of seeing a loss of γ or more. Such a “value at risk” can be viewed as a probabilistic worst-case scenario, and might be used to specify precisely where the “upper tail” of the loss distribution starts. Proceeding further, one might be interested in computing a more sophisticated “coherent” risk measure such as “expected shortfall” [16], which in this example is defined as the expected total loss, given that this loss exceeds γ. More generally, the entire conditional distribution of the loss—given that the loss exceeds γ—might be of interest. MCDB In prior work, the Monte Carlo Database System (MCDB) of Jampani et al. [13] was designed for flexible exploration of queryresult distributions under arbitrary SQL queries and a broad range of complex, user-defined stochastic models. MCDB uses (possibly user-defined) “variable generation” (VG) functions to pseudorandomly generate instances of each uncertain data value in a database, yielding a sample from the possible-DB distribution. Repeating this process multiple times (i.e., executing multiple “Monte Carlo repetitions”) generates a set of independent and identically distributed (i.i.d.) samples from this distribution. Given an SQL aggregation query of interest, MCDB executes the query on each sampled DB instance, thereby generating i.i.d. samples from the query-result distribution. MCDB uses Monte Carlo techniques to estimate interesting features of the query-result distribution—the expected value, variance, and quantiles of the query answer—along with probabilistic error bounds on the estimates. Importantly, a VG function takes as input one or more parameter tables (ordinary relations) that control the function’s behavior, and produces as output a table containing one or more correlated data values. For n Monte Carlo repetitions, MCDB does not actually materialize an uncertain database n times; the costs for such a naive approach would be exorbitant. Instead, MCDB executes a query plan only once over a set of tuple bundles rather than over ordinary tuples—no matter how many Monte Carlo repetitions are required—often leading to significant time savings. A tuple bundle encapsulates the instantiations of a tuple over a set of generated DB instances, and carries along the pseudorandom number seeds used by the VG functions to instantiate the uncertain data values. Limitations of MCDB for Risk Analysis Unfortunately, naive Monte Carlo, as implemented in the original MCDB prototype, is not the best tool for exploring the tails of a query-result distribution. Consider the query SELECT SUM(loss) AS totalLoss FROM t, where t.loss is an uncertain attribute, perhaps representing a future financial loss. Suppose that the query-result distribution is normally distributed, with a mean of $10 million and a standard deviation of $1 million. Suppose that interest centers on the upper tail of this distribution that corresponds to totalLoss values of $15 million or more. On average, roughly 3.5 million Monte Carlo repetitions are required before such an extremely high loss is observed even once. If our goal is simply to estimate the area of the tail (i.e., the probability of seeing a totalLoss value of $15 million or more), then 130 billion repetitions are required to estimate the desired probability to within ±1% with a confidence of 95%. If the user instead wants to define the tail indirectly starting at the value γ, where γ is the 0.999 quantile of the totalLoss distribution, then standard quantile-estimation techniques [19, Sec. 2.6] require roughly ten million Monte Carlo repetitions to estimate γ to within ±1% with a confidence of 95%, Our Contributions In this paper, we describe an enhanced version of MCDB for risk analysis called MCDB-R. MCDB-R inherits the key strengths of MCDB. Chief among these is generality. MCDB can handle almost any stochastic model, as well as database operations such as aggregation, grouping, and so forth. However, unlike MCDB, MCDB-R supports efficient risk analysis. This presents several technical challenges: efficiently locating a tail of interest and efficiently sampling from the tail, all in the presence of black-box VG functions and complex SQL queries. Our contributions are as follows. 1. We adapt “cloning” and Gibbs-sampling ideas from the rareevent simulation literature to develop a statistical method for both estimating a user-specified quantile on a query-result distribution and then generating a set of samples from the tail. 2. We show how the tail-sampling method can be integrated into MCDB’s current tuple-bundle processing infrastructure. 3. In the Appendix, we provide guidance on both setting the sampling parameters for MCDB-R and identifying situations in which MCDB-R will be effective. Related Work The original MCDB system was partially inspired by several efforts to push analytic capabilities into the database or “close to the data,” such as MauveDB [10], which was oriented toward sensor data and integrated smoothing and interpolation methods. These ideas have been further developed in subsequent systems such as FunctionDB [21] and the SciDB project [20]. A recent discussion of analytics in the database can also be found in [6], and there have been a number of efforts to push statistical analyses into Map-Reduce data processing environments; see, e.g., [5, 12]. MCDB is also related to “probabilistic databases” (PrDBs) [7, 8]. Uncertainty in a PrDB is typically represented not by general stochastic models as in MCDB, but rather by explicit probabilities on alternative tuple values. MCDB can capture the functionality of a PrDB, but will not be as efficient or precise in cases where a PrDB can compute answers exactly and efficiently. Overall, MCDB is quite different from PrDBs in terms of motivation, design, and application space. The recent PIP system [14] combines PrDB and Monte Carlo techniques, yielding superior performance for certain MCDB-style queries with simple VG-function parameterizations. The current paper is also related to the general problem of “conditioning” a probabilistic database, as discussed in Koch and Olteanu [15]. The work in [15] focuses on exact probability calculations in the setting of PrDBs, whereas the current paper emphasizes Monte-Carlo-based approximations. 2. SPECIFYING QUERIES IN MCDB-R In this section, we indicate what MCDB-R looks like to a user, via a simple example. Suppose that we want to assess potential total financial loss over a specified subset of our clients. The steps in the analysis are to (1) define a stochastic loss model, (2) execute a SUM query over the uncertain loss values specified by the model, and then (3) report pertinent features of the total-loss distribution. For step (1), the user defines an uncertain loss table, just as in MCDB. Suppose for simplicity that we “model” the loss for a customer as being distributed according to a normal distribution with variance 1 and a mean that is customer specific. For this simple model, we use the built-in Normal VG function that generates normally distributed random numbers. The data needed to parameterize this function is stored on disk in a “parameter table” (an ordinary SQL table) called means(CID,m), which stores the mean loss for each customer. The uncertain table Losses(CID,val) that we wish to query is not stored on disk; only the schema is stored. The schema is specified by the following SQL-like statement: CREATE TABLE Losses (CID,val) AS FOR EACH CID IN means WITH myVal AS Normal(VALUES(m, 1.0)) SELECT CID, myVal.* FROM myVal This statement is identical to a standard SQL CREATE TABLE statement, except for the FOR EACH clause. The statement gives a recipe for constructing a sample instance of Losses. Scan through the CIDs in the means table. For each CID, create a one-row, onecolumn table myVal by invoking the Normal VG function, parameterized with the mean loss for the CID and the default variance value of 1.0. Form a row of the DB instance by joining myVal with the CID from the outer FOR EACH loop, as specified in the final SELECT clause. See [13] for more details on schema specification. For steps (2) and (3), we write a query as follows: SELECT SUM(val) as totalLoss FROM Losses WHERE CID < 10010 WITH RESULTDISTRIBUTION MONTECARLO(100) DOMAIN totalLoss >= QUANTILE(0.99) FREQUENCYTABLE totalLoss The first three lines correspond to a standard SQL (aggregation) query. Since this query is being executed over uncertain data, there is a probability distribution over the query result. The DOMAIN clause indicates that we want to condition the query-result distribution: we discard all possible query results not in the specified domain and then we renormalize the query-result distribution so that the total probability over the domain is 1. In our example, the domain coincides with the upper tail of the loss distribution corresponding to the highest 1% of losses. The MONTECARLO keyword indicates that we will use an approximate (conditioned) query-result distribution estimated from 100 Monte Carlo samples. The remaining lines of the WITH clause specify the statistical features of the query-result distribution that are to be computed. The FREQUENCYTABLE keyword causes creation of an ordinary table FTABLE(totalLoss,FRAC) in which the first column contains the distinct values of totalLoss observed over the 100 tail samples, and FRAC is the fraction of Monte Carlo samples in which each value was observed. By running the query SELECT MIN(totalLoss) FROM FTABLE we obtain an estimate of the lower tail-boundary, i.e., the extreme quantile of interest, which is accurate when the number l of Monte Carlo samples is large.1 Similarly, by executing a query such as 1 As discussed in the next section, our tail-sampling algorithm also produces an estimate of the extreme quantile, but the two estimates will be almost identical for large l. Algorithm 1 Systematic Gibbs sampler 1: Inputs: 2: X (0) : initial random element of X r 3: k: number of Gibbs updating steps 4: Output: 5: X (k) : updated value of X (0) 6: 7: G IBBS(X (0) , k): 8: x ← X (0) // Initialize 9: for j ← 1 to k do 10: // Perform one systematic updating step: X (j−1) → X (j) 11: for i ← 1 to r do 12: xi ← G EN C OND(x−i , i) // Generate from h∗i ( · |x−i ) 13: end for 14: end for 15: Return x // = X (k) SELECT SUM(totalLoss * FRAC) FROM FTABLE we can estimate the expected shortfall, i.e., the expected loss, given that the loss was among the highest 1% of losses. 3. MONTE CARLO BACKGROUND We next review results from the Monte Carlo literature and show how they are relevant to the problem of estimating an extreme quantile and efficiently sampling from the tail defined by the quantile. The key methods that we adapt are Gibbs sampling and cloningbased methods for rare-event simulation. 3.1 Gibbs Sampling Gibbs sampling [11] is a technique that is designed for generating samples from a high-dimensional probability distribution function that is known only up to a normalizing constant. For simplicity of notation, we describe the method in the setting of simple discrete random variables; the technique extends straightforwardly to continuous and mixed random variables. Let X = (X1 , X2 , . . . , Xr ) be an r-dimensional random vector taking values in a discrete set X r and distributed according to h, so that h(x1 , . . . , xr ) = P (X1 = x1 , . . . , Xr = xr ) for x = (x1 , . . . , xr ) ∈ X r . A key requirement of Gibbs sampling is that we can efficiently generate samples from the conditional distributions h∗i (u|v) = P (Xi = u | Xj = vj for j 6= i), where 1 ≤ i ≤ r. According to this definition, h∗i is the marginal distribution of Xi , given the values of Xj for j 6= i. For a vector x ∈ X r , we set x−i = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xr ) and then we can define h and h∗i more concisely as h(x) = P (X = x) and h∗i (u|v) = P (Xi = u | X−i = v) for u ∈ X and v ∈ X r−1 . (0) (0) Given an initial value X (0) = (X1 , . . . , Xr ), the “systematic” Gibbs sampler in Algorithm 1 generates a sequence X (0) , . . . , X (k) of random vectors. The function G EN C OND(x−i , i) invoked in Step 12 generates a sample from h∗i ( · |x−i ). Since each new sample is generated recursively from the previous sample, the sequence {X (j) } forms a Markov chain. The Gibbs sampler is a special case of a Markov Chain Monte Carlo (MCMC) sampling technique. In our database setting, the Gibbs sampler is more appealing that other MCMC methods because it is extremely generic in nature, requiring no special knowledge about the distribution h. Crucially, if the initial sample X (0) is generated from h, then the chain will be stationary in that every subsequent sample will be distributed according to h [1, Th. XIII.5.1]. Although the samples are not statistically independent, under mild regularity conditions Algorithm 2 Rejection algorithm for example Gibbs sampler 1: G EN C OND(v, i): 2: repeat 3: Generate u according to hi 4: until Q(u ⊕i v) ≥ c 5: Return u the random vectors X (0) and X (k) become increasingly independent as k increases.2 This “convergence to independence” is usually exponentially fast, so that k need not be very large. (In our experiments taking k = 1 sufficed.) Thus, if two chains {X (j) } and {Y (j) } start from the same state X (0) = Y (0) but the respective Gibbs updates are performed independently for the chains, then X (k) and Y (k) will become approximately independent as k increases, again typically at an exponentially fast rate. We illustrate the method further with an example that is pertinent to our tail-sampling problem. Suppose that r is large and that the components of X are statistically independent, so that h(x) = Q r i=1 hi (xi ), where hi (u) = P (Xi = u) is the marginal distribution of Xi for 1 ≤ i ≤ r. We assume that it is straightforward to generate samples from each hi . Let Q be a real-valued function defined on X r , and suppose that for some value c we want to generate samples from the conditional distribution h(x; c) = P (X = x | Q(X) ≥ c) = h(x)I Q(x) ≥ c /pc . Here I(A) = 1 if event APoccurs and I(A) = 0 otherwise, and pc = P (Q(X) ≥ c) = x∈X r :Q(x)≥c h(x). If the cardinality of X is very large, then pc is extremely hard to compute, so that h(x; c) is known only up to the constant of proportionality pc , and direct generation of samples from h(x; c) is nontrivial. We can, however, apply the Gibbs sampler. Note that, by the independence of the components of X, we have h∗i (u|v) = P Xi = u | Q(Xi ⊕i v) ≥ c . Here u ⊕i v denotes the vector (v1 , . . . , vi−1 , u, vi , . . . , vk−1 ). Thus, samples from h∗i in Step 12 can be generated by a rejection algorithm (Algorithm 2). To illustrate one especially simple case, Psuppose that Q(x) = x1 + · · · + xr . Then Q(u ⊕i v) = u + j vj , so that h∗i (u|v) = P (Xi = u | P Xi ≥ c − j vj ). An efficient implementation of Algorithm 2 would operate on an existing Gibbs iterateP x ∈ X r by subtracting − xi from Q(x), thereby computing q = j6=i xj , and then generating samples u from hi until u + q − ≥ c. In the following, we write G IBBS(X (0) , k, c) and G EN C OND(v, i, c) to indicate the dependence of these two algorithms on the parameter c. To see the relevance of the above example to MCDB-R, consider a database consisting of a single table R having a single, random attribute A and exactly r tuples in all possible worlds (e.g., like the Losses table in Section 2 with the CID attribute dropped). Suppose that we interpret Xi above as the random variable corresponding to the value of the ith tuple (1 ≤ i ≤ r) and Q as an aggregation query over the values in R. Then the foregoing discussion shows that if we have somehow obtained a DB instance D(0) sampled from h(x; c)—i.e., an instance that is “large” in the sense that Q(D(0) ) ≥ c—we can run the Gibbs sampler with the rejection algorithm for k steps to obtain another random instance D(k) that is (approximately) independent of D(0) and also ‘large”—i.e., also distributed according to h(x; c). That is, once we have obtained a DB instance corresponding to an upper tail of the query-result 2 More precisely, the Markov chain associated with the Gibbs sampler is usually “irreducible” and “aperiodic” in an appropriate sense, and the chain “mixes” at an exponential rate; see [3]. distribution, we can generate additional, independent instances that also yield query results lying in the tail. In the general MCDB-R setting, we can still interpret X as a random vector in which each component corresponds to an uncertain value in the database, but the components of X decompose into mutually independent blocks, where the variables within a block are dependent and are all generated via a call to a specified VG function. Moreover, the number of variables in a block—i.e., the number of values returned by a VG function—may vary over DB instances. Although formal discussion of Gibbs sampling at this level of generality would lead to very cumbersome notation, the basic idea is the same as above. 3.2 Rare-Event Simulation The foregoing discussion shows that Gibbs sampling can be used to generate essentially independent samples from a tail of the queryanswer distribution, starting from a “large” DB instance. However, such a starter instance can be hard to obtain, even when the quantile that defines the tail is known a priori. To address this challenging problem, we adapt ideas from the literature on “rare-event simulation”—see, e.g., [17] for an overview. The primary goal of rareevent simulation is to estimate the probabilities of very infrequent events; quantile estimation has received very little attention. The two main techniques for rare-event simulation are importance sampling (IS) and cloning, also called splitting. The idea in IS is to sample from a modified possible-DB distribution (by suitably modifying the VG functions) such that extreme DB instances are generated with high frequency. Due to numerical instability, however, IS is known to behave badly for high dimensional problems [2]. In our setting, the effective dimension typically is roughly equal to the number of tuples in a relation, rendering IS unusable. We therefore look to cloning techniques. The cloning approach has been of interest to the simulation community for a long time, but only recent work [2, 4, 18] has considered the type of “static” Monte Carlo simulations that are of interest in our setting. These algorithms start with a set of n randomly generated “particles.” Each particle gives rise to a Markov chain that is obtained by repeatedly applying a random MCMC update (“perturbation”) to the particle. The set of particles is “enriched” by deleting “non-extreme” particles and cloning “extreme” particles. The “Gibbs cloning” framework of Rubinstein [18] provides a general formulation of these ideas that we adapt for our purposes. 3.3 Adapting the Gibbs Cloner Adapting and specializing the Gibbs cloning framework to our setting—the application to quantiles is novel—yields the basic procedure of Algorithm 3. In the algorithm, we represent a random database D as a vector of random variables, as per our prior discussion. (We treat each deterministic data value c as a random variable that is equal to c with probability 1.) In the algorithm, we maintain a set S of DB instances over a sequence of bootstrapping steps, so called because we are “bootstrapping” our way out to the tail. At the start of the ith step, S contains ni instances, and the algorithm proceeds by (1) purging all but the top 100pi % “elite” (most extreme) instances, (2) cloning the elite instances to increase the size of S up to ni+1 , and then (3) perturbing each instance in S using the Gibbs sampler, in order to re-establish (approximate) independence while ensuring that all instances yield query results that lie in the current tail. The function C LONE(S, n) operates by simply duplicating each DB instance in S approximately n/|S| times. The Appendix describes how to choose the algorithm parameters to efficiently control the relative difference between the desired upper-tail probability p and the actual probability of the upper tail Algorithm 3 Basic tail-sampling algorithm 1: Inputs: 2: p: target upper-tail probability 3: l: desired number of tail samples 4: Outputs: 5: γ̂: estimate of (1 − p)-quantile 6: S: set of l samples from h( · ; γ̂) 7: Parameters 8: k: number of Gibbs updating steps 9: m: number of bootstrapping steps 10: n1 , n2 , . . . , nm : intermediate sample sizes 11: p1 , p2 , . . . , pm : intermediate tail probabilities 12: 13: // Initialize 14: Generate databases D(1) , . . . , D(n1 ) i.i.d. according to h( · ) 15: S ← { D(1) , D(2) , . . . , D(n1 ) } 16: nm+1 ← l 17: // Execute m bootstrapping steps 18: for i ← 1 to m do 19: γ̂i ← the (pi |S|)-largest element of { Q(D) : D ∈ S } 20: Discard all elements D ∈ S with Q(D) < γ̂i 21: S ← C LONE(S, ni+1 ) 22: for D ∈ S do // Gibbs-update step 23: D ← G IBBS(D, k, γ̂i ) // G IBBS defined as in Sec. 3.1 24: end for 25: end for 26: return γ̂ = γ̂m and S defined by the quantile estimator γ̂. In particular, we show that this relative error is minimized by setting ni = n and pi = p1/m for 1 ≤ i ≤ m. Here n = N/m, where N is the total number of samples over all bootstrapping steps; increasing N increases cost but improves accuracy. We assume these values throughout. After the ith purge, the smallest query-result value for the retained elite samples is an estimate of γi , defined to be the (1 − pi/m )-quantile of the query-result distribution; thus the γi ’s are increasing and γm corresponds to the extreme quantile of interest. At each step i, we estimate a 1−p1/m quantile of Fi−1 , the conditional distribution of the query result Q(D), given that Q(D) ≥ γi−1 . (Here we take γ0 = −∞.) For typical values of, say, p = 0.001 and m = 4, we see that even though we ultimately need to estimate an extreme 0.999-quantile, at each step we merely need to estimate a 0.82-quantile, a much easier problem. 4. IMPLEMENTATION OVERVIEW In the following sections, we describe how we implemented tail sampling (or TS for short) as in Algorithm 3. The goal is to preserve, to the extent possible, the computational efficiency of the MCDB tuple-bundle processing mechanism as discussed previously. 4.1 The Gibbs Looper and Data Streams In the MCDB-R setting, we call the iterative process in Algorithm 3 the Gibbs Looper. There are two key variables that track the progress of the this algorithm: cutoff and curQuantile. In terms of our previous notation, cutoff is the γi value that defines the current tail and, after i iterations, curQuantile = pi/m is the probability that a query-result instance falls in the current tail. At each iteration of the Gibbs Looper, each of the n versions (i.e., realized instances) of the database is randomly perturbed by the Gibbs sampler to form n new DB versions. Initially, a cloned database has the same contents as its parent, but as each DB ver- sion is perturbed by its Gibbs sampler, parent and child will drift apart. To “fuel” the perturbation, the Gibbs sampler needs access to a stream of random data. There is a data stream associated with every uncertain data value (or correlated set of uncertain data values) in the database. E.g., consider the random Losses table from Section 2. The tuple bundle corresponding to a given customer is initially enhanced with a seed for use by the pseudorandom number generator (PRNG). Repeated execution of the Normal VG function, parameterized with the customer’s mean loss value m, produces a stream of realized loss values for the customer. In MCDB, the first n elements of the stream simply correspond to the customer’s losses in n Monte Carlo repetitions, i.e., in n generated DB instances. In MCDB-R, because of the rejection algorithm (Algorithm 2) used in the Gibbs sampler, there is a more complicated mapping from DB instance to position in the stream, but the Gibbs sampler nonetheless goes to the stream whenever it needs a loss value for the customer. We will often refer to a stream of data via the PRNG seed that was used to produce it. In general, a VG function may produce a table of correlated data values when called, so an “element” of a stream may actually comprise a set of values. Also, a given PRNG seed may occur in multiple tuples—due to a 1-to-m join of the random table to another table—or multiple times in a tuple bundle due to a self-join. 4.2 An Example We illustrate the Gibbs Looper in more detail using the total-loss query of Section 2 as an example; for brevity, we drop the CID column from the Losses and means tables, as well as the selection predicate on CID from the query itself. Suppose that there are three customers and that the corresponding three rows in the means table have values 3.0, 4.0, and 5.0, respectively. Also suppose that p = 1/32 and n = 4, so that we want to generate a sample of four total losses that are each in the top 3.125% of the total-loss (i.e., query-result) distribution. We will use m = 5 iterations of the Gibbs Looper to generate the samples. First, the underlying VG function is used to generate three streams of random data, as depicted in Fig. 1. The first four values produced by each VG function are assigned, in sequence, to each of the n = 4 DB versions, as in Fig. 1(a). After this initialization, iteration i = 1 of the Gibbs Looper begins. First, the two DB versions with final SUM values in the lowest 100(1 − p1/m )% percentile (50% in our example) are discarded, and the remaining two elite DB versions are cloned. “Cloning” is accomplished by duplicating the assignment of random values from each stream of random data to the various DB versions, as depicted in Fig. 1(b). The current value for cutoff at this point is set to be 12.07, since this is the aggregate value associated with the least-extreme remaining DB version. cutoff also serves as an estimate for the point at which there is only a 100p1/m % (= 50%) chance of seeing a larger value for the answer to the SUM query. Since two of the DB versions have been duplicated, it is necessary to perturb all of the versions so that they are sufficiently differentiated. In our example, the Gibbs sampler begins the perturbation by attempting to replace the value 3.26 that is associated with the first stream of loss values (the stream associated with mean value 3.0). This is done by considering the next unassigned random value associated with mean 3.0—indicated in bold in Fig. 1(b). Unfortunately, this value (which is 3.24) cannot be used to replace 3.26, because it would result in an overall aggregate value of 12.05, which does not meet or exceed the current value of cutoff = 12.07. Thus, the 3.24 is rejected. Next, the 5.13 is considered. Since the 5.13 would result in an overall aggregate value of 13.94 (which exceeds cutoff), the 5.13 is accepted. The procedure is repeated using the other two streams. The 4.59 associated with mean.val = 4.0 is replaced with 4.55 (acceptable since this decreases the aggregate value to 13.90, which exceeds the cutoff of 12.07), and the 4.22 associated with mean.val = 5.0 is replaced with 4.29, resulting in the state of Fig. 1(c). DB version two is updated in the same way. The value 3.26 is replaced with 4.07 (acceptable, since the new aggregate value is 12.88). An attempt is made to replace the 4.59 associated with mean.val = 4.0 using 3.68. However, this would reduce the overall aggregate for DB version two to 11.97, which does not meet or exceed the current value for cutoff. So the 3.68 is rejected, and 5.16 (the next unassigned value associated with mean.val = 4.0) is chosen. Finally, the 4.22 is replaced with 5.77, resulting in Fig. 1(d). After DB versions three and four are updated, the result is Fig. 1(e). Then iteration i = 2 begins, and cloning the top two DB versions gives us Fig. 1(f). This whole process is repeated until the end of iteration i = m. 4.3 Implementing A Gibbs Looper In practice, there are many possible ways to implement a Gibbs Looper in the presence of complicated database operations such as joins. The simplest would maintain n different random versions of the database. Then, in each iteration of the Gibbs looper (where an “iteration” is the process of perturbing all n DB versions) and for each DB version, we would undertake the following steps. For each stream of random data, we would (1) replace the current stream element that is being used in the DB version with a new stream element, (2) run the query over the modified DB version, (3) accept the new stream element if the new query result exceeds cutoff and otherwise try steps (1) and (2) again. The drawback of such an implementation is that it requires running the underlying query many, many times over many, many random databases. For example, imagine that we will maintain n = 100 DB versions, each requiring a modest one million streams of random data. Imagine that the Gibbs Looper needs ten iterations to complete and that we need to reject (on average) ten random values before one is accepted. In this case, we will need to run an entire query plan 100 × 106 × 10 × 10 = 1010 different times to complete the TS procedure, which is obviously unacceptable. We therefore use MCDB’s tuple-bundle trick of running the underlying query only once, no matter how many DB versions are needed. The extensions to MCDB’s tuple-bundle mechanism are many and intricate; we now sketch the key ideas. 5. GIBBS TUPLE BUNDLES As with MCDB, query execution in MCDB-R is organized around the idea of bundling together the different versions of a tuple over the generated DB instances. To execute a query, MCDB-R runs a plan that pushes a stream of instantiated tuple bundles into a special GibbsLooper operation. These bundles must have all of the information necessary to implement the process illustrated in Fig. 1, i.e., given a DB version and a stream of random data, the Gibbs Looper must be able to determine which random value from the stream is assigned to the DB version, and then efficiently determine the effect of an update to the value on the final query result. This requires that MCDB tuple bundles must be augmented with a notion of “lineage” that links each random value in the database to the stream from which it came. Gibbs tuples therefore differ from MCDB tuple bundles. Instead of containing a PRNG seed, a Gibbs tuple contains a pointer to a “tail-sampling seed”, or TS-seed for short. This object, described in Section 6 below, augments the basic PRNG seed with “bookkeeping” information needed by the Gibbs Looper. In an MCDB bun- SUM = 14.17 (e) 3.08 3.91 7.64 SUM = 14.63 DB Version 1 DB Version 2 means.val = 5.0 assigned to? assigned to? means.val = 4.0 DB Version 3 DB Version 1 SUM = 15.00 4.07 5.16 5.77 DB Version 2 assigned to? means.val = 5.0 assigned to? means.val = 4.0 assigned to? means.val = 3.0 DB Version 1 DB Version 2 means.val = 5.0 assigned to? SUM = 13.11 4.07 5.16 5.77 2.59 3.18 4.87 3.26 4.59 4.22 2.23 3.10 6.09 SUM = 15.00 4.56 3.96 5.65 3.24 4.55 4.29 5.13 3.68 5.77 V12 3.08 4.07 V12 5.16 V12 4.04 3.91 2.24 5.56 7.64 V34 0.13 3.91 V34 4.54 7.64 3.51 4.01 3.69 2.44 4.02 5.85 SUM = 14.63 3.08 V34 3.84 6.58 3.41 4.13 5.36 3.08 4.25 4.55 5.53 1.22 5.16 3.77 3.91 2.82 3.17 6.18 7.64 DB Version 3 3.51 5.56 4.04 SUM = 14.17 (c) DB Version 4 V4 V2 V3 V4 SUM = 15.00 V1 V2 V3 V4 SUM = 13.97 DB Version 4 DB Version 4 V3 V1 4.87 4.22 6.09 5.65 4.29 5.77 4.04 7.64 4.54 3.69 5.85 6.58 5.36 5.53 3.77 6.18 assigned to? means.val = 3.0 DB Version 1 DB Version 2 assigned to? means.val = 5.0 means.val = 4.0 assigned to? assigned to? V1 V2 3.18 4.59 3.10 3.96 4.55 3.68 5.16 5.56 3.91 4.01 4.02 3.84 4.13 4.55 5.16 3.17 4.07 5.16 5.77 5.13 4.55 4.29 3.26 4.59 2.59 3.18 4.87 4.22 3.26 V2 4.59 V2 4.22 V2 2.23 3.10 6.09 4.56 V34 3.96 V34 5.65 V34 SUM = 12.07 3.24 4.55 V1 4.29 V1 5.13 V1 3.68 5.77 4.56 4.07 5.16 4.04 3.96 2.24 5.56 7.64 0.13 3.91 4.54 5.65 3.51 4.01 3.69 2.44 4.02 5.85 SUM = 14.17 3.08 3.84 6.58 3.41 4.13 5.36 4.56 4.25 4.55 5.53 1.22 5.16 3.77 3.96 2.82 3.17 6.18 5.65 SUM = 13.97 DB Version 3 2.59 3.26 2.23 4.56 3.24 5.13 4.07 2.24 0.13 3.51 2.44 3.08 3.41 4.25 1.22 2.82 5.13 4.55 4.29 DB Version 4 DB Version 4 DB Version 3 2.59 3.18 4.87 3.26 4.59 4.22 2.23 3.10 6.09 4.56 V34 3.96 V34 5.65 V34 SUM = 15.00 3.24 4.55 V1 4.29 V1 5.13 V1 3.68 5.77 V2 4.56 4.07 V2 5.16 V2 4.04 3.96 2.24 5.56 7.64 0.13 3.91 4.54 5.65 3.51 4.01 3.69 2.44 4.02 5.85 SUM = 14.17 3.08 3.84 6.58 3.41 4.13 5.36 4.56 4.25 4.55 5.53 1.22 5.16 3.77 3.96 2.82 3.17 6.18 5.65 assigned to? 4.07 5.16 5.77 means.val = 4.0 SUM = 13.97 DB Version 2 means.val = 5.0 assigned to? means.val = 4.0 assigned to? assigned to? means.val = 3.0 5.13 4.55 4.29 3.26 4.59 4.22 SUM = 14.17 (b) DB Version 1 SUM = 14.17 means.val = 3.0 4.56 3.96 5.65 SUM = 12.07 DB Version 3 SUM = 11.42 (a) (d) means.val = 3.0 DB Version 1 SUM = 12.07 2.23 3.10 6.09 3.26 4.59 4.22 2.59 3.18 4.87 3.26 V12 4.59 V12 4.22 V12 2.23 3.10 6.09 4.56 V34 3.96 V34 5.65 V34 SUM = 12.07 3.24 4.55 4.29 5.13 3.68 5.77 4.56 4.07 5.16 4.04 3.96 2.24 5.56 7.64 0.13 3.91 4.54 5.65 3.51 4.01 3.69 2.44 4.02 5.85 SUM = 14.17 3.08 3.84 6.58 3.41 4.13 5.36 4.56 4.25 4.55 5.53 1.22 5.16 3.77 3.96 2.82 3.17 6.18 5.65 assigned to? V1 V2 V3 V4 DB Version 2 means.val = 5.0 4.87 4.22 6.09 5.65 4.29 5.77 4.04 7.64 4.54 3.69 5.85 6.58 5.36 5.53 3.77 6.18 3.26 4.59 4.22 DB Version 3 V1 V2 V3 V4 assigned to? means.val = 4.0 3.18 4.59 3.10 3.96 4.55 3.68 5.16 5.56 3.91 4.01 4.02 3.84 4.13 4.55 5.16 3.17 SUM = 10.64 DB Version 4 V1 V2 V3 V4 assigned to? assigned to? means.val = 3.0 2.59 3.26 2.23 4.56 3.24 5.13 4.07 2.24 0.13 3.51 2.44 3.08 3.41 4.25 1.22 2.82 2.59 3.18 4.87 (f) SUM = 14.63 Figure 1: Running the tail-sampling process to obtain four DB instances dle, PRNG seeds may be discarded if they are no longer needed. In MCDB-R, a TS-seed handle can never be discarded, because each random value must always point to its associated PRNG (i.e., associated stream). Another important difference is that, whereas an instantiated MCDB tuple bundle contains exactly n stream elements per random value—one for each Monte Carlo repetition—an instantiated Gibbs bundle contains many more elements, because the Gibbs Looper can use up many stream elements during the Gibbs perturbation process.3 Indeed, the Gibbs Looper may run out of data during processing and need to re-execute part of the query plan (see Section 9). The number of stream elements to instantiate in a Gibbs tuple is chosen to trade off the cost of carrying lots of data throughout the query plan with the cost of re-executing part of the query if the data runs out. For an example of how Gibbs tuples are pushed through a query 3 Because instantiated values are often needed to prepare tuple bundles for the Gibbs Looper—see the example below—they cannot be simply generated on the fly during Gibbs-Looper operation. plan, consider the following query, which determines a company’s total salary “inversion,” resulting from certain employees who earn more than their managers: SELECT SUM(emp2.sal - emp1.sal) FROM emp AS emp1, emp AS emp2, sup WHERE sup.boss = emp1.eid AND emp1.sal < 90K AND sup.peon = emp2.eid AND emp2.sal > 25K AND emp2.sal > emp1.sal Imagine that the attribute emp.sal is random, and we want three samples from the tail of the query-result distribution. The resulting query plan and execution of this query is illustrated in Fig. 2. This plan operates over Gibbs tuples rather than ordinary tuples. An instantiated Gibbs tuple can have one or more arrays of random values; every array of random values is associated with exactly one PRNG seed. A Gibbs tuple can also have one or more isPres attributes. An array of isPres values is created when a selection predicate is applied to a random attribute, and indicates for each DB instance whether or not the predicate is satisfied. If the predicate emp(sal,eid) ??? Joe ??? Sue ??? Ann ??? Jim ??? Sid temp(sal,eid) (Seed1) (Seed2) (Seed3) (Seed4) (Seed5) ??? Joe ??? Sue ??? Ann ??? Jim ??? Sid Seed sup(boss,peon) Sue Joe Jim Sue Jim Ann Sid Jim eid=boss temp(sal,eid1,eid2) (Seed2) (Seed4) ??? Jim Sue ??? Sue Joe (Seed4) (Seed5) ??? Jim Ann ??? Sid Jim temp(sal,eid1,sal2,eid2) (Seed2) (Seed1) (Seed4) (Seed2) (Seed4) (Seed3) (Seed5) (Seed4) ??? Sue ??? Joe ??? Jim ??? Sue ??? Jim ??? Ann ??? Sid ??? Jim eid=peon Instantiatesa1 temp(sal,eid1,sal2,eid2) (Seed2) (Seed1) (Seed4) (Seed2) 24K Sue ??? Joe 77K Jim ??? Sue 76K 25K 72K 26K 79K 22K 70K 23K (Seed4) (Seed3) (Seed5) (Seed4) 77K Jim ??? Ann 92K Sid ??? Jim 76K 94K 72K 99K 79K 97K 70K 91K temp(sal,eid1,isPres1,sal2,eid2) (Seed2) (Seed2) (Seed4) (Seed1) 24K Sue Yes ??? Joe 77K Jim Yes ??? Sue 25K Yes 76K Yes 26K Yes 72K Yes 22K Yes 79K Yes 23K Yes 70K Yes σ sal1<90K σ sal2>25K temp(sal,eid1,isPres1,sal2,eid2) (Seed2) (Seed4) (Seed1) 24K Sue Yes 28K Joe Yes 77K Jim Yes 25K 76K Yes Yes 26K Yes 26K 72K Yes Yes 25K No 22K 79K Yes Yes 21K No 23K 70K Yes Yes 27K Yes (Seed2) 24K Sue 25K 26K 22K 23K Yes Yes Yes Yes Yes (Seed3) ??? Ann Instantiatesal2 temp(sal,eid1,isPres1,sal2,eid2) (Seed2) (Seed2) (Seed4) (Seed1) 24K Sue Yes 28K Joe 77K Jim Yes 24K Sue 25K Yes 26K 76K Yes 25K 26K Yes 25K 72K Yes 26K 22K Yes 21K 79K Yes 22K 23K Yes 27K 70K Yes 23K No No Yes No No (Seed4) 77K Jim 76K 72K 79K 70K (Seed4) 77K Jim 76K 72K 79K 70K Yes Yes Yes Yes Yes (Seed3) 45K Ann Yes Yes 39K Yes 43K Yes 47K Yes 44K (Seed4) 77K Jim 76K 72K 79K 70K Yes Yes Yes Yes Yes GibbsLooper( 0.001, sal2>sal1, sum(sal2-sal1)) (Seed3) 45K Ann 39K 43K 47K 44K res 14K 11K 10K Figure 2: Gibbs-tuple processing over a multi-table query plan is not satisfied in any DB instance, then the entire Gibbs tuple is dropped. (Unlike MCDB, we cannot simply record isPres values for the entire tuple, since attribute values change individually during Gibbs sampling.) Finally, note that the query plan makes use of the MCDB-R Seed and Instantiate operations. The former operation attaches the handle for a TS-seed to each Gibbs tuple, and the latter operation uses a PRNG seed to attach a subsequence from a stream of random values to the Gibbs tuple. 6. TAIL-SAMPLING SEEDS A TS-seed contains both the PRNG seed needed to instantiate a stream of random values and the information needed to map the current value of an attribute in a DB version to a position in the stream. This mapping is needed to compute a query-result value for the current DB version. E.g., in Fig. 2, the query plan produces a set of three Gibbs tuples that are piped into the GibbsLooper operation. Suppose that we have the assignment (seed1, 1), (seed2, 1), (seed3, 2), (seed4, 2) for a DB version. That is, Joe’s salary is $28K (the first value in his stream), Ann’s salary is $39K (the second value in her stream), and so forth. Then the query result for this DB version is ($28K - $24K) = $4K, since Joe makes $4K more than Sue and no other employee makes more than their boss. In MCDB-R, a TS-seed contains (1) a TS-seed identifier, (2) the actual PRNG seed used to produce a stream of random data, (3) the range of stream values currently materialized and present within the Gibbs tuples, (4) the last random value in that range that has previously been assigned to any DB version for this TS-seed, and (5) the random value currently assigned to each DB version for this TS-seed. Item (5) is the mapping described above, and item (4) is needed during Gibbs rejection sampling to obtain a new random value. Item (3) is needed to quickly determine whether the Gibbs Looper has run out of data and, if so, the position in the stream from which to start replenishing. A call to the Seed operator both adds a seed handle to a Gibbs tuple and creates the actual TS-seed data structure for the tuple, which is written to disk. 7. THE GIBBS LOOPER The Appendix describes the GibbsLooper operator in detail. A key feature is that it switches the inner and outer for loops of Algorithm 3. Rather than perturb DB versions one at a time, looping through the random data values (i.e., the TS-seeds), the algorithm perturbs data values one at a time, looping through the DB versions, thereby amortizing expensive data scans. Specifically, GibbsLooper inserts the Gibbs tuples into a diskbased priority queue; the initial sort key for a Gibbs tuple is the smallest TS-seed handle in the tuple. GibbsLooper then iterates through the overall set of TS-seed handles in increasing order. For each TS-seed handle, it loads the TS-seed—and all “associated” Gibbs tuples that contain at least one handle for the TS-seed—into memory. Then, for the first DB version, it consults the TS-seed to determine the first unused stream value to try in the rejection algorithm. It uses the associated Gibbs tuples to compute the changed value of the query result for the DB version due to the new stream value, and then checks if the new query-result value still exceeds cutoff. If so, then GibbsLooper accepts the new stream value and moves on to update the next DB version. If not, it tries again until an appropriate stream value is found. Once all DB versions have been updated, it reinserts all of the Gibbs tuples into the priority queue—first updating the sort key for each reinserted tuple to be the next largest TS-seed handle in the tuple—and moves on to the next overall TS-seed handle. (Thus a Gibbs tuple will be processed multiple times if it contains multiple TS-seed handles.) GibbsLooper is efficient because it essentially merges Gibbs tuples in the disk-based priority queue with a sorted file containing all of the TS-seeds. When a TS-seed is loaded into memory, the priority queue is used to quickly obtain all of the Gibbs tuples that could possibly be affected by an update of the TS-seed. 8. JOINS ON RANDOM ATTRIBUTES One issue that demands more explanation is how GibbsLooper handles join predicates on random attributes. E.g., suppose R1 joins with R2 on a random attribute, and R2 joins with R3 on a deterministic attribute. Some t2 in R2 is modified when one of its random attributes is updated, so it joins with both a t1 from R1 and a t3 from R3 —the result is that a new tuple “pops” into existence. Fortunately, this case is easily handled by MCDB-R. When the underlying query plan is run (before GibbsLooper is ever invoked), original MCDB’s Split operation is run on any random attribute or attributes that are to be joined. This operation changes the attributes so that they are no longer random and transfers any non-determinism to the isPres attribute. E.g., consider a Gibbs tuple g with schema (fname, age) where attribute fname is constant, and attribute age is non-constant. Specifically, age = 20 for the first and third values in the stream of random data stored in the Gibbs tuple, and age = 21 for the second and fourth values: g = (Jane,(20,21,20,21),(Y,Y,Y,Y)), where the last nested vector contains the isPres values, and indicates that Jane appeared in all four Monte Carlo repetitions (though with varying ages). An application of the Split operation to t with Atts = {age} yields two Gibbs tuples: g1 = (Jane,20,(Y,N,Y,N)) and g2 = (Jane,21,(N,Y,N,Y)). Then, any join on the age attribute is a join on a deterministic attribute, and we will not have the problem of new Gibbs tuples “popping” into existence, since two sets of Gibbs tuples (each associated with one of Jane’s possible ages) would both be created and sent to GibbsLooper. Furthermore, only one of them can be valid at a time: if, e.g., Jane’s age is updated to be 21, then the mapping information within the TS-seed associated with Jane’s age is updated as well, causing any Gibbs tuple associated with an age of 20 to become invalid. 9. RE-RUNNING THE QUERY PLAN Since each Gibbs tuple contains only a finite amount of data, the Gibbs Looper may run out of data while perturbing a DB version. If so, MCDB-R discards all existing Gibbs tuples, empties the GibbsLooper priority queue, and then runs the entire query plan again to produce the required data. The newly-created Gibbs tuples that contain the data are piped into the GibbsLooper operation. During such a “replenishing” run, query-plan execution differs from the first time around in a couple of ways. First, the result of each deterministic part of the query plan is materialized and saved during the first run, avoiding the need to repeat these computations during replenishment. Second, the Instantiate operation never adds stream values to a Gibbs tuple that have already been processed; it only adds new or currently assigned values. 10. CONCLUSION MCDB-R extends the MCDB system to permit risk evaluation in the database. Space constraints have forced us to exclude some important material from the paper. Some very brief benchmarking results, given in the Appendix, provide evidence for the accuracy of the method and show a reduction in processing time from roughly 18 hours in MCDB to 11 minutes in MCDB-R. Acknowledgements This work was supported in part by the US DoE under grant no. DE-SC0001779 and the US NSF under grant no. 0915315, and by a Sloan Foundation Fellowship. 11. REFERENCES [1] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, 2007. [2] Z. I. Botev and D. P. Kroese. An efficient algorithm for rare-event probability estimation, combinatorial optimization, and counting. Methodol. Comput. Appl. Prob., 10:471–505, 2008. [3] R. C. Bradley. Basic properties of strong mixing conditions: A survey and some open questions. Probab. Surveys, 2:107–144, 2005. [4] F. Cérou, P. D. Moral, T. Furon, and A. Guyader. Rare event simulation for a static distribution. INRIA Research Report 6792, Rennes, France, 2009. [5] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281–288, 2006. [6] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. PVLDB, 2(2):1481–1492, 2009. [7] N. N. Dalvi, C. Ré, and D. Suciu. Probabilistic databases: diamonds in the dirt. Commun. ACM, 52(7):86–94, 2009. [8] A. Das Sarma, O. Benjelloun, A. Y. Halevy, S. U. Nabar, and J. Widom. Representing uncertain data: models, properties, and algorithms. VLDB J., 18(5):989–1019, 2009. [9] H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003. [10] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD, pages 73–84, 2006. [11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intelligence, 6(6):721–741, 1984. [12] S. Guha. RHIPE - R and Hadoop Integrated Processing Environment. http://ml.stat.purdue.edu/rhipe/. [13] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB: a Monte Carlo approach to managing uncertain data. In ACM SIGMOD, pages 687–700, 2008. [14] O. Kennedy and C. Koch. PIP: A database system for great and small expectations. In ICDE, pages 157–168, 2010. [15] C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313–325, 2008. [16] A. J. McNeil, R. Frey, and P. Embrechts. Quantitative Risk Management: Concepts, Techniques, and Tools. Princeton University Press, 2005. [17] G. Rubino and B. Tuffin, editors. Rare Event Simulation Using Monte Carlo. Wiley, 2009. [18] R. Rubinstein. The Gibbs cloner for combinatorial optimization, counting, and sampling. Methodol. Comput. Appl. Prob., 11(4):491–549, 2009. [19] R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, 1980. [20] M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and SciDB. In CIDR, page 26, 2009. [21] A. Thiagarajan and S. Madden. Querying continuous functions in a database system. In SIGMOD, pages 791–804, 2008. APPENDIX A. THE GIBBS LOOPER IN DETAIL A.1 Initialization Aside from n, m, and p (see Section 3.3), GibbsLooper takes as input five principal arguments: 1. The pipeline producing the input stream of Gibbs tuples. 2. The aggregate operation from whose tail distribution GibbsLooper will sample, e.g., sum(sal2-sal1) in Fig. 2. 3. The final selection predicate (e.g., sal2 > sal1 in Fig. 2) that must be applied to each tuple prior to inclusion of the tuple’s value in the aggregate. In particular, any selection predicate that references random attributes generated by more than one PRNG seed must be pulled up into GibbsLooper. Such a predicate cannot be applied lower in the query plan, because the Gibbs sampling will arbitrarily combine random values from the streams associated with different PRNG seeds, and it is not possible to know the outcome of the predicate before GibbsLooper runs its Gibbs sampler. 4. The grouping attributes (if any) for the query.4 5. A file containing all of the TS-seed objects, sorted on each object’s handle. Given these inputs, GibbsLooper initially processes the stream of incoming Gibbs tuples by computing the initial aggregate value for each of the n DB versions and inserting the tuples into a diskbased priority queue, sorted by the smallest TS-seed handle in each Gibbs tuple (as discussed in Section 7). The initial aggregate values are computed for the TS-seed mapping in which the ith value in each stream is mapped to the ith DB version for 1 ≤ i ≤ n. Next, GibbsLooper discards the “non-elite” DB versions whose aggregate totals are in the bottom 100(1 − p1/m )% by overwriting them with clones of the elite versions. The cutoff value is set equal to the smallest aggregate value. The overwriting of a nonelite DB version by a clone of an elite version is accomplished during a single read/write pass through the TS-seed file (see below). A.2 Performing the Perturbation At this point, the GibbsLooper is ready to begin perturbing the various DB versions using a Gibbs sampler. The manner in which the perturbation is accomplished is best illustrated using an example, given as Fig. 3. Fig. 3(a) begins with the three Gibbs tuples that were output from the query plan of Fig. 2, inserted into the GibbsLooper’s priority queue. (For simplicity, we drop the isPres attributes and look at only two DB instances.) The four TS-seeds are also depicted, as are the two DB versions that correspond to the initial TS-seed mapping (described above). The goal at this point to perturb these versions via Gibbs sampling, subject to the constraint that the total aggregate value must meet or exceed cutoff = $1K. Note that while the evolving DB versions are depicted in the figure, they are never actually materialized; they are completely determined, however, by the current state of the Gibbs tuples and the TS-seeds. To obtain two new DB versions with SUM values that meet or exceed cutoff, the first TS-seed handle, (seed1), is considered, and the lone associated Gibbs tuple is removed from the priority queue, as in Fig. 3(b). DB version one is updated by assigning the first unassigned random value associated with seed1 to the DB version, transforming (24K, Sue, 28K, Joe) into (24K, Sue, 25K, 4 Joe). This decreases the final aggregate result from $4 to $1K, since Joe’s salary exceeds Sue’s by $1K. Because the total salary inversion still meets the cutoff of $1K, the update is accepted. Next, we attempt to update DB version two. The next unassigned random value in the stream associated with seed1 is the fourth one, which transforms (25K, Sue, 26K, Joe) into (25K, Sue, 21K, Joe). This changes the final aggregate value to $0, and so this proposed update is rejected. Next we try the fifth random value in the stream. This gives us (25K, Sue, 27K, Joe), which increases the final aggregate value to $2K, and so it is accepted. We are now done with seed1, and so we reinsert the one Gibbs tuple associated with this seed back into the priority queue, after first updating its sort key to its next largest seed handle, (seed2); see Fig. 3(c). As shown in Fig. 3(d), the two Gibbs tuples associated with (seed2) are then removed from the priority queue. The assignments for seed2’s stream are updated in much the same way as for seed1; note, however, that seed2 influences two tuples in each DB version. Once the update of seed2 is complete and both of the associated Gibbs tuples have been reinserted back into the priority queue, we are left with Fig. 3(e). Note that the keys for these reinserted Gibbs tuples are now (seed4) and infinity. The value infinity is used because the random values associated with seed1 and seed2 have been each been updated—there is no reason to consider this Gibbs tuple again during this particular perturbation, so it is forced to the tail of the queue. As illustrated in Fig. 3(f), we now move to (seed3). The two DB instances are updated with respect to seed3; because seed3 control’s Ann’s salary and Ann has no impact on the query result, both updates are trivially accepted. The process continues with (seed4), which again has no effect on the final query result, at which point both DB versions are fully perturbed (Fig. 4(a)). Since all of the TS-seed handles have been processed, it is now time to clone the elite DB versions and overwrite the non-elite versions. This process is illustrated by the transition from Fig. 4(a) to Fig. 4(b). In a single read/write pass, the column in each TSseed that records the assignment for DB version two is simply copied to the column for version one. Now the first iteration of GibbsLooper has completed, and iteration two can begin. Grouping is handled by, in effect, treating a GROUP BY query over g groups as g separate, simultaneous queries, each with a selection predicate that limits the query to a specific group. B. WHEN WILL MCDB-R WORK BEST? Algorithm 1—and hence Algorithm 3—will work best when, for each most-likely extreme database D, the query result Q(D) is relatively insensitive to changes in the value of a single data value x. This scenario fails, e.g., for a SUM query in which data values X1 , . . . , Xr are i.i.d. according to a “subexponential” distribution such as lognormal or Pareto. Under such a heavy-tailed distribution, P (X1 + · · · + Xk > c) ≈ kP (X1 > c) for large c; i.e., with high probability, the query result for an extreme database D— viewed as a vector y of values—will lie in the upper tail because some individual yi is extremely large. After generating a new candidate value u for yi in Algorithm 2, the value of Q(u ⊕i y) will likely drop sharply, and many candidates will be required prior to acceptance in Line 4. Our approach is thus unlikely to be universally applicable. That said, it is still useful for a wide class of riskestimation problems, as evidenced by the extensive applied literature that pertains to extreme-value theory for light-tailed distributions and by the prevalence of aggregates, such as SUM and AVG, that have the insensitivity property under a light-tailed regime. C. SETTING PARAMETERS In this section, we address the key question of how to choose the parameters in Algorithm 3. Because the cost of a Gibbs-updating 1 1 1 1 5 5 5 5 2 2 2 2 1 1 1 1 DB Version 1 24K Sue 28K Joe 77K Jim 45K Ann 77K Jim 24K Sue SUM = 4K 2 2 2 2 seed1 62 seed2 27 seed3 91 seed4 14 DB Version 2 25K Sue 26K Joe 77K Jim 27K Ann 76K Jim 25K Sue (a) PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. Head (seed3) 45K Ann 39K 43K 47K 44K Priority Queue (seed2) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K seed identifier PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. Head seed identifier seed1 62 seed2 27 seed3 91 seed4 14 (seed3) (seed4) 77K Jim 76K 72K 79K 70K 1 1 1 1 5 5 5 5 5 2 2 2 3 1 1 1 5 2 2 2 (seed1) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 27K 23K SUM = 2K 3 1 1 1 seed1 62 seed2 27 seed3 91 seed4 14 DB Version 2 25K Sue 27K Joe 77K Jim 39K Ann 76K Jim 25K Sue seed identifier PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. 5 5 5 5 5 5 2 2 3 4 1 1 5 5 2 2 DB Version 1 SUM = 3K 23K Sue 27K Joe 77K Jim 39K Ann 76K Jim 23K Sue 5 5 2 2 3 4 1 1 5 5 2 2 (seed2) (seed4) 77K Jim 76K 72K 79K 70K seed1 62 seed2 27 seed3 91 seed4 14 1 1 1 1 5 5 5 5 5 5 4 2 3 4 3 1 5 5 4 2 (seed3) (seed4) 77K Jim 76K 72K 79K 70K DB Version 1 22K Sue 25K Joe 77K Jim 45K Ann 77K Jim 22K Sue SUM = 3K DB Version 2 (seed2) 24K Sue 25K 26K 22K 23K Priority Queue (seed4) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K 22K Sue 25K Joe 77K Jim 45K Ann 77K Jim 22K Sue DB Version 2 5 5 5 5 (seed2) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 21K 22K 27K 23K (seed3) 45K Ann 39K 43K 47K 44K 23K Sue 27K Joe 77K Jim 39K Ann 76K Jim 23K Sue SUM = 4K (infinity) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K Tail (e) 1 1 1 1 (infinity) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K 1 1 1 1 (d) Tail seed1 62 seed2 27 seed3 91 seed4 14 Priority Queue (seed4) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. seed identifier SUM = 1K 5 2 2 2 (seed3) 45K Ann 39K 43K 47K 44K Head DB Version 1 SUM = 2K (seed3) (seed4) 77K Jim 76K 72K 79K 70K (seed3) (seed4) 77K Jim 76K 72K 79K 70K 24K Sue 25K Joe 77K Jim 45K Ann 77K Jim 24K Sue (c) Head (seed3) 45K Ann 39K 43K 47K 44K PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. 5 2 2 2 (seed3) (seed4) 77K Jim 76K 72K 79K 70K seed identifier 5 5 5 5 Priority Queue Head Head seed identifier PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. 1 1 1 1 SUM = 1K Tail seed1 62 seed2 27 seed3 91 seed4 14 DB Version 1 24K Sue 25K Joe 77K Jim 45K Ann 77K Jim 27K Sue DB Version 2 (b) Tail (seed2) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K (seed3) 45K Ann 39K 43K 47K 44K 25K Sue 27K Joe 77K Jim 27K Ann 76K Jim 25K Sue SUM = 1K Priority Queue (seed2) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K (seed3) (seed4) 77K Jim 76K 72K 79K 70K Tail Priority Queue (seed2) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K Tail (seed1) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K DB Version 1 (seed3) 45K Ann 39K 43K 47K 44K (f) 22K Sue 25K Joe 77K Jim 43K Ann 77K Jim 22K Sue SUM = 3K DB Version 2 23K Sue 27K Joe 76K Jim 47K Ann 76K Jim 23K Sue SUM = 4K SUM = 3K Figure 3: The perturbation process for the “salary inversion” query iteration is proportional to the data size and hence extremely expensive, we choose the number of iterations k to be as small as possible, subject to the requirement that, after the Gibbs-update step in line 23, the ni+1 databases in S are approximately statistically in- dependent. The MCMC literature provides a variety of procedures for determining k. Recall, however, that a sequence of Gibbs samples typically “forgets” its initial state exponentially fast, so that very small values of k suffice in practice; as mentioned previously, 1 1 1 1 5 5 5 5 5 5 4 4 3 4 3 3 5 5 4 4 (seed4) (seed4) 77K Jim 76K 72K 79K 70K (a) DB Version 1 (seed3) 45K Ann 39K 43K 47K 44K 22K Sue 25K Joe 72K Jim 43K Ann 72K Jim 22K Sue SUM = 3K DB Version 2 (seed2) 24K Sue 25K 26K 22K 23K 23K Sue 27K Joe 79K Jim 47K Ann 79K Jim 23K Sue (infinity) (seed4) (seed2) 77K Jim 24K Sue 76K 25K 72K 26K 79K 22K 70K 23K PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. seed identifier PRNG seed low iter. num. high iter. num. max. used iter. V1 iter. num. V2 iter. num. seed identifier seed1 62 seed2 27 seed3 91 seed4 14 (infinity) (seed4) (seed3) 77K Jim 45K Ann 76K 39K 72K 43K 79K 47K 70K 44K seed1 62 seed2 27 seed3 91 seed4 14 1 1 1 1 5 5 5 5 5 5 4 4 5 5 4 4 5 5 4 4 (b) SUM = 4K (infinity) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K Tail (seed4) (seed4) 77K Jim 76K 72K 79K 70K Tail (infinity) (seed2) (seed1) 24K Sue 28K Joe 25K 26K 26K 25K 22K 21K 23K 27K DB Version 1 23K Sue 27K Joe 79K Jim 47K Ann 79K Jim 23K Sue SUM = 4K DB Version 2 23K Sue 27K Joe 79K Jim 47K Ann 79K Jim 23K Sue SUM = 4K Figure 4: Perturbation example, continued taking k = 1 worked well in experiments. Now suppose that k has been fixed and is sufficiently large so that Gibbs updating yields samples that can be viewed as independent. Also suppose that the total number N of samples over all bootstrapping steps has been fixed. We consider the problem of selecting m, n1 , . . . , nm and p1 , . . . , pm to minimize statistical error. One approach is to minimize the relative difference between the desired upper-tail probability p and the actual probability of the upper tail defined by the quantile estimator γ̂m . I.e., we solve the following optimization problem: minimize m,n1 ,...,nm ,p1 ,...,pm such that m X ni = N i=1 E and h F̄ (γ̂ ) − p 2 i 0 m p m Y pi = p (1) (2) i=1 pi ∈ [0, 1] and ni ∈ { 0, 1, . . . , N } , 1≤i≤m (3) Here F0 (x) = P Q(D) < x and F̄0 = 1 − F0 . Given a solution to the minimization problem (1)–(3), the overall parameterselection task will reduce to selection of N . We refer to the minimand in (1) as the mean-squared relative error (MSRE). Note that an alternative goal might be to control the relative error of γ̂m directly. For light-tailed query-result distributions, which are the class of primary interest in this paper, minimizing MSRE rather than minimizing the relative error of γ̂m typically leads to more stringent error control. E.g., suppose that, unknown to us, F0 is the standard normal distribution, and that our desired tail probability is p = 0.001, so that γ ≈ 3.090. If we allow the actual tail probability to deviate from p by no more than ±1%, then the corresponding estimate of γ will lie approximately in the range (3.087, 3.093), which represents a maximum error of about ±0.1%. In the following, we fix k and N as described above, and show how to select values of m, n1 , . . . , nm , and p1 , . . . , pm that approximately solve the minimization problem (1)–(3). We assume that F0 is continuous and strictly increasing. We first abstract the process by which Algorithm 3 computes γ̂m . For real numbers c1 , c2 , . . . , cm , set F1 (x; c1 ) = F0 (x) − F0 (c1 ) = P Q(D) < x | Q(D) ≥ c1 , 1 − F0 (c1 ) and for 2 ≤ i ≤ m recursively define Fi (x; c1 , . . . , ci ) = Fi−1 (x; ; c1 , . . . , ci−1 ) − Fi−1 (ci ; c1 , . . . , ci−1 ) 1 − Fi−1 (ci ; c1 , . . . , ci−1 ) for any sequence c1 , c2 , . . . , cm . An inductive argument establishes the identity Fi (x; c1 , . . . , ci ) = F1 (x; ci ) for 2 ≤ i ≤ m. For a set X1 , X2 , . . . , Xn of i.i.d. samples from a strictly increasing continuous distribution function F , the order statistics are denoted as X(1) < X(2) < · · · < X(n) . Setting ri = ni (1 − pi ) (rounded to the nearest integer), and assuming that k is sufficiently large as discussed previously, we can view Algorithm 3 as follows. The algorithm proceeds by generating n1 i.i.d. samples X0,1 , X0,2 , . . . , X0,n1 from F0 . Then, for i = 1, 2, . . . , m, the algorithm (1) sets γ̂i = Xi−1,(ri ) and (2) generates ni+1 samples Xi,1 , . . . , Xi,ni+1 from Fi ( · ; γ̂1 , . . . , γ̂i ) = F1 ( · ; γ̂i ). To analyze the behavior of F̄0 (γ̂m ), and hence the behavior of the MSRE, we employ the common tactic of reducing the analysis to the setting of uniform random variables. Specifically, let { Ui,j : 0 ≤ i ≤ m, 1 ≤ j ≤ ni+1 } be a collection of i.i.d. random variables uniformly distributed on [0, 1]. Consider a procedure that sets V0,j = U0,j for 1 ≤ j ≤ n1 , and then, for i = 1, 2, . . . , m, sets α̂i = Vi−1,(ri ) and Vi,j = (1 − α̂i )Ui,j + α̂i for 1 ≤ j ≤ ni+1 . Using a standard result [1, p. 37] on uniform random variables together with an inductive argument, it can be shown that, for 1 ≤ i ≤ m, (1) F0−1 (Vi−1,j ) has the same distribution as Xi−1,j for 1 ≤ j ≤ ni and (2) F0−1 (α̂i ) has the same distribution as γ̂i . In particular, F̄0 (γ̂m ) is distributed as 1 − α̂m . An easy argument shows that 1 − α̂i = Zi (1Q− α̂i−1 ), where Zi = 1−Ui−1,(ri ) , which implies that 1− α̂m = m i=1 Zi . By construction, the random variables {Ui−1,(ri ) }, and hence the random variables {Zi }, are mutually independent. Moreover, each Ui,(ri ) has a Beta(ri , ni − ri + 1) distribution [9]. Fixing m ∈ { 1, 2, . . . , N }, η = (n1 , . . . , nm ), and π = (p1 , . . . , pm ) satisfying (2)–(3), a calculation using both the independence of the Zi ’s and standard properties of the Beta distribution shows that MSRE = u(η, π, m), where u(η, π, m) = h1 (η, π, m) h2 (η, π, m)p−2 − 2p−1 + 1. Q Here hc (η, π, m) = m i=1 (ni pi + c)/(ni + c) for c = 1, 2. We now consider how to choose η, π, and m so as to approximately minimize u, subject to (2)–(3). It follows from the results given below that p ≤ hc (η, π, m) ≤ 1 for c = 1, 2 and feasible choices of the parameters. Moreover, h1 (η, π, m) ≈ h2 (η, π, m) for m ∈ [1, N ]. We can establish the following result. T HEOREM 1. For c ∈ { 1, 2 }, hc (η, π, m) is minimized by setting m∗c = min { m ≥ 1 : gm (N, p, c) < gm+1 (N, p, c) }, ηc∗ = ∗ ∗ (N/m∗c , . . . , N/m∗c ), and πc∗ = (p1/mc , . . . , p1/mc ). To summarize, we can approximately minimize the MSRE by (1) computing m∗1 and m∗2 as indicated in the theorem, (2) choosing θ = arg minc∈{1,2} u(ηc∗ , πc∗ , m∗c ) and setting m∗ = m∗θ , and (3) ∗ setting ni = N/m∗ and pi = p1/m for 1 ≤ i ≤ m∗ . In practice, ∗ ∗ it is often the case that m1 = m2 . Thus we need only choose N . For an MSRE target value , we can select N by numerically finding min { N : w(N ) ≤ }, where w(N ) = gm∗ (N, p, 1) gm∗ (N, p, 2)p−2 − 2p−1 + 1. Note that limN →∞ w(N ) = 0, which implies that our quantile estimator converges in mean square to the true value as the total number of DB versions increases. D. BENCHMARKING We have implemented a C++ prototype of MCDB-R with the MCDB code base as a starting point, including all of the features described in the paper. As with our MCDB implementation, MCDBR does not yet have an optimizer or SQL compiler; instead, we use an MCDB-specific language to specify a query plan directly. We now describe a small benchmark of our prototype. Consider the following random version of the TPC-H orders table: CREATE TABLE random ord (o orderkey, o yr, o tot) AS FOR EACH o IN (SELECT * FROM orders) WITH VAL AS Normal (VALUES(o mean, o var)) SELECT o.o orderkey, year(o.o orderdate), v.value FROM VAL v This version of orders attaches a normally-distributed random “loss” value to each tuple (we use a mean and variance of one). Now, consider the following query: SELECT SUM(val) as totalLoss FROM random ord, lineitem WHERE o orderkey = l orderkey AND (o yr = ’1994’ OR o yr = ’1995’) Because the sum of a set of normal variables is also normal, we know that the result of this query must itself have a normal distribution with mean and variance computed as follows: SELECT SUM(grpsize * o mean) AS mean, SUM(grpsize * grpsize * o var) AS var FROM (SELECT o mean, o var, COUNT(*) AS grpsize FROM orders, lineitem WHERE year(o orderdate) in (’1994’, ’1995’) AND o orderkey = l orderkey GROUP BY o orderkey, o mean, o var) Thus, if we run the former query using adding the following: WITH RESULTDISTRIBUTION MONTECARLO(100) DOMAIN totalLoss >= QUANTILE(0.999) then we can use the result of the latter query to test the accuracy of the resulting tail sample. Using the TPC-H database (scalefactor = 10) on an 8-core server, we ran the former query using 1 0.9 Cumulative Probability for the range of p and N values encountered in practice. Thus, to a good approximation, the problem of minimizing ulooks like the problem of minimizing v(x) = x (x/p2 ) − (2/p) such that x ∈ [p, 1]. It is easy to see that v is strictly increasing on [p, 1], and so we want to make x as small as possible. That is, we want to make h1 and h2 as small as possible. To this end, set (N/m)p1/m + c m gm (N, p, c) = (N/m) + c 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4.975E+05 5.025E+05 5.075E+05 Query Result Figure 5: Observed empirical CDFs versus analytic CDF GibbsLooper parameters m = 5, p1/m = 0.25, N = 500, and l = 100. This should (if everything is correct) compute a set of 100 samples from the 1.0 − (0.25)5 = 0.99902 ≈ 0.999 quantile of the query-result distribution. Each TS-seed is used to produce 1000 random values initially. The first three GibbsLooper iterations require 156, 124, and 134 seconds to complete. At that point, the first TS-seed runs out of random data, and the query plan is run again to fuel the last two iterations. These iterations take 122 seconds and 115 seconds to run, for a total time of around eleven minutes to complete the tail sampling process and obtain 100 database instances in the upper 0.999-quantile. By comparison, MCDB would require about 18 hours for this task. Next we tested the accuracy of MCDB-R. We re-ran the above query, but changed how the underlying database is generated to make the example more interesting (and difficult). We modified random ord so that it is generated by selecting 100,000 tuples from the orders parameter table. The mean (resp., variance) of each normal used to generate random ord was itself generated by sampling from an inverse gamma-distribution with shape 3 and scale 1 (resp., shape 3 and scale 0.5). The lineitem table was generated so that one million of the tuples join with some tuple from random ord, and the rest find no mate. The probability that one of those tuples from lineitem will mate with the first of the 100,000 tuples in random ord is 2 × 10−5 − 10−10 . The probability that the tuple will mate with the ith tuple in random ord is equal to the probability that it will mate with the (i − 1)th tuple, minus 2 × (10−5 − 10−10 )/(105 − 1). We ran MCDB-R 20 times on this query using GibbsLooper parameters m = 5, p1/m = 0.25, N = 1000, and l = 100. Figure 5 shows the 20 empirical tail CDFs that we observed (each based on 100 samples). The true CDF for the query-result distribution (calculated analytically as described above) appears as a thick black line on the left side of the plot. The true tail CDF corresponding to the 0.99902 quantile of the query-result distribution appears as a thick black line on the right. It is clear from the plot that the 20 empirical tail CDFs cluster closely around the true tail CDF. We also obtained 20 estimates of the 0.99902 quantile by recording the value of the minimum tail sample in each of the 20 runs. The mean value of these 20 estimates was 5.0728e5, whereas the true quantile value was 5.0738e5. The empirical standard error of the estimates was 265. To put this in perspective, the middle 99% of the query-result distribution has a width of approximately 2503; i.e., by generating only 1000 intermediate database instances (efficiently, using Gibbs tuples), MCDB-R is able to “walk” out to the upper 0.99902 quartile and incur a standard error of only 10%.