Stochastic Analytics in the Database Peter J. Haas Kevin Beyer Vuk Ercegovac Bo Shekita Chris Jermaine* Ravi Jampani Luis Perez Mingxi Wu Fei Xu IBM Almaden Research Ctr. U. Florida Gainesville *Rice University 1 August, 2010 Outline • • • • Motivation via examples MCDB: Monte Carlo Database System MC3: MCDB + map-reduce Future directions 2 August, 2010 Problem Setting ? ? • Large data sets ? • Missing or uncertain data • Stochastic models used to “guess” values – Model gives probability distribution on data values • Want to run BI queries over guessed values • Want to assess uncertainty in query answers – Risk assessment – Decisionmaking 3 August, 2010 Ex. 1: Portfolio Values Customer EuroCallOptions CustID OptionID NumShares … OptionID InitVal … StrikeP OVal John Smith 23 50 … 23 $2.35 … $4.00 ? … … … … … … … SELECT SUM (c.NumShares * o.Val) FROM Customer c, EuroCallOptions o WHERE c.OptionID = o.OptionID AND c.CustType = ‘Institutional’ … Option value one month from now (exercise date) Modified Black-Scholes model for European call option: ( ) OVal = max (V (t final ) − S,0 ) dV = rV dt + a V V dW Simulation approximation (Euler formula): ( ) V (t + Δt ) = V (t ) + rV (t )Δt + a V (t ) V (t ) ΔtZ j Sample from Normal dist’n 4 August, 2010 Ex. 2: Pricing Decisions CustID Unit Price Order Amount J. Smith $10.20 500 … … … • • • Data for one customer price Data for all customers price Bayes Theorem demand demand Global demand distribution (prior) Individual demand distribution (posterior) Can analyze arbitrary dynamic customer segments when determining effect of price increase Similar approach for web-click behavior (EBay, Websphere portal) Issues: Complex model, huge number of dynamic parameters 5 August, 2010 Ex. 3: Data-Warehouse Uncertainty Data Integration {John Smith, San Jose} {John Smith, San Jose} ETL {John Smith, Los Angeles} Name John Smith Name City John Smith (SJ, 0.66), (LA, 0.33) City LA Name Sales J. Smith $50K Information extraction A lovely thing to behold is Paris Hilton in the Springtime … 09/09/2007 Re: system crash -------------------------This morning, my ORACLE system on LINUX exploded in a spectacular fireball … Similarity Join Hotels System T Hotel Annotator Text Miner Sales LA $50K ? (0.92) (Michelakis et al., 2009) NY Marriott Paris Hilton City ? (0.20) Source Problem type Cust0385 (DBMS, 0.8), (OS, 0.2) 6 August, 2010 Risk Due to Data Uncertainty • Ex: Value of assets (for financial reporting, compliance, business-process monitoring) probability SELECT SUM (s.amount) FROM SALES s, CUST c WHERE s.ID = c.ID AND c.city = ‘Los Angeles’ Query-result distribution 5% VaR expected answer Total LA sales • Ex: ERP – # OS experts needed for help desk – Based on (uncertain) extracted text data from last year – Provide principled safety factor 7 August, 2010 Traditional Workflow Arena, R, Matlab,… Arena, R, Matlab,… Model Model Data reduction Analyst (PhD) Develops model Model fitting • • • Data extraction slow and bug-prone Only coarse-grained modeling No encapsulation for user Model application & querying • • • Hard to re-link model results to DB Hard to deal with data updates Sensitivity, what-if analysis are hard Goal: Integrate model with Database August, 2010 Model 8 Prior Work • MauveDB (Deshpande & Madden 06) – Continuous deterministic models only • Probabilistic databases (Trio, MayBMS, ORION, MystiQ, K-relations, et al.) – – – – – – For data-warehouse uncertainty Hard-wired, limited uncertainty model (deterministic skeleton) Limited queries (top-k) Complexity issues Independence assumptions What-if analysis is hard 9 August, 2010 Outline • • • • Motivation MCDB: Monte Carlo Database System MC3 Future directions 10 August, 2010 The MCDB System i.i.d. samples from possible-worlds dist’n Random DB = D Schema VG Functions Parameter Tables d1 Monte Carlo Generator Q d2 ... Q(D) = Select SUM(sales) AS t_sales Q Q(d1) Q(d2) : Q(dn) Q i.i.d. samples from query-result dist’n dn ˆ [ t_sales ] E ˆ [ t_sales ] Var qˆ .01 [ t_sales ] Histogram Error bounds Inference Monte Carlo Estimator 11 August, 2010 MCDB Example CID Region CID Shape Region Scale 102 NewEngland 102 1.2 NewEngland 7.0 226 Midwest 226 0.7 Midwest 2.1 CUST_ATTR Q: SELECT SUM(Amount) FROM SALES AS t_sales AMT_SHAPE CID SALES AMT_SCALE CID Shape Scale 102 1.2 7.0 226 0.7 2.1 Amount VG function Gamma(shape, scale) CID d1 Amount CID 102 $120.00 226 $60.00 Q(d1) = $180 d2 d3 Amount CID 102 $80.00 102 $80.00 226 $90.00 226 $130.00 Q(d2) = $170 Amount Q(d3) = $210 ˆ ˆ E[t_sales] = $186.67 STD[t_sales] = $20.82 August, 2010 12 Advantages of MCDB • Flexible and extensible uncertainty model – Can capture extended relational models (Trio, MayBMS, Mystiq,…) – Can capture huge range of stochastic models • Can bring complex stochastic models to data (no extraction needed) • Encapsulates complexity – Once expert has written VG function, naïve user can run queries • Can handle arbitrary SQL queries • What-if analysis, sensitivity analysis, data updates are easy 13 August, 2010 VG Functions Value Weight San Jose 0.66 San Francisco 0.34 Value DiscreteChoice() parameter table San Jose output table (instance) Pseudorandom # seed • Used to generate instances of values in random tables – Parameter tables are standard relational tables (can index, etc.) – Library of standard functions: DiscreteChoice, Normal, Poisson, … – Can define custom functions (similar to UDFs) 14 August, 2010 Pseudorandom Number Generators (PRNG) • Needed by VG function – E.g., to generate “random” sales values • Produces a deterministic sequence of seeds – Appears random – Cycles around • .. Typical PRNG recurrence: – Si+1 = M * Si mod m – Seed S = vector of k unsigned integers – M is a matrix • • Sn-1S0 S1 S 2 PRNG Cycle of Seeds Transform seeds to desired random samples Cycle usually “split” into disjoint segments – Skip factor • Keeping only initial seed, S0, is sufficient to regenerate sequence 15 August, 2010 . VG Functions and Correlation ID1 ID2 Cov 1 1 1.23 1 2 0.17 V1 V2 2 2 2.45 1.21 2.13 MDNormal() Correlated columns or ID Mean 1 3.68 ID Val 2 4.75 1 1.21 2 2.13 Correlated rows Pseudorandom # seed 16 August, 2010 Schema Syntax: Example CREATE TABLE RAND_CUST (CID, GENDER, MONEY, LIVES_IN) AS FOR EACH d in CUST WITH MONEY AS Gamma( (SELECT n.SHAPE FROM MONEY_SHAPE n WHERE n.CID = d.CID), (SELECT sc.SCALE FROM MONEY_SCALE sc WHERE sc.REGION = d.REGION), (SELECT SHIFT FROM MONEY_SHIFT) ) WITH LIVES_IN AS DiscreteChoice ( (SELECT c.NAME, c.PROB FROM CITIES c WHERE c.REGION = d.REGION) ) SELECT d.CID, d.GENDER, m.VALUE, l.VALUE FROM MONEY m, LIVES_IN l 17 August, 2010 Query Processing • Naïve approach – Repeatedly instantiate DB and run query – Has horrible performance • MCDB approach – Execute query plan once – Process tuple bundles instead of tuples • Represents tuple in all simulated possible worlds (MC reps) • Permits a variety of performance optimizations 18 August, 2010 Tuple Bundles (4 MC Repetitions) (Jane, Smith, 20) (Jane, Smith, 21) -(Jane, Smith, 21) Tuple bundle isPresent (Jane, Smith, (20,21,x,21), (T,T,F,T), Seed) (Jane, Smith, (T,T,F,T ), Seed) Representation Compressed representation Basic ideas: (a) Keep bundles in compressed form whenever possible (b) Apply selections early to compressed bundles (c) Amortize I/O, network costs, etc. over multiple reps 19 August, 2010 Operations on Tuple Bundles • Seed: (Jane, Smith, --, --) u (Jane, Smith, --, --, Seed) • Split: (Jane, Smith, (20,21,20,21), (T,T,T,T), Seed) u (Jane, Smith, 20, (T,F,T,F), Seed), (Jane, Smith, 21, (F,T,F,T), Seed) • Inference: (Jane, Smith, (20,21,20,21), (T,T,T,T), Seed) u (Jane, Smith, 20, 0.5), (Jane, Smith, 21, 0.5) August, 2010 Also: Aggregate 20 Standard Operations • Select (FNAME = ‘Jane’ AND AGE = 20) (Jane, Smith, (20,21,20,21), (F,T,T,T), Seed) (John, Jones, (32,31,20,30), (T,T,F,T), Seed) (Jane, Jones, (21,23,22,22), (T,T,T,T), Seed) u (Jane, Smith, (20,21,20,21), (F,F,T,F), Seed) • Join (equijoin on Department #) Uses SPLIT + sort-merge (Smith, (D1,D2,D2,D1), (F,T,T,T), Seed1) (Jones, (D1,D2,D2,D2), (T,T,F,T), Seed2) u (Smith, D2, Jones, D2, (F,T,F,F), Seed1, Seed2) 21 August, 2010 Estimation and Inference Distinct tuple values MCDB inference operator WITH Stats(Mu, Var) AS ( SELECT SUM(Val1*Frac), SUM(Val*Val1*Frac) - SUM(Val1*Frac)*SUM(Val1*Frac) FROM OutputTable) SELECT Mu AS Mean, SQRT(Var) AS Stdev, 1.96*SQRT(Var)/SQRT(1000) AS CIHW FROM Stats SQL queries TotSales Frac 20K 0.324 … … Frac. replications where value appears (vs bit vector) OutputTable WITH CumDistFn(TotSales, Cum, PrevCum) AS ( SELECT TotSales, SUM(Frac) OVER (ORDER BY TotSales ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), SUM(Frac) OVER (ORDER BY TotSales ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) FROM OutputTable) SELECT Val FROM CumDistFn WHERE Cum >= 0.5 AND PrevCum < 0.5 August, 2010 22 Experimental Queries • Q1: Next year’s revenue gain from Japanese products – Assuming current trends hold – Each order duplicated Poisson # of times – Poisson mean = (this year)/(last year) for customer • Q2: Order Delays – From placement to delivery – Fitted Gamma distribution for each delay type (for each part) • Q3: What if we had used cheapest supplier? – TPC-H only has current prices – Prior prices generated by backwards random walk with drift • Q4: Change in profits with 5% price increase – Bayesian model of customer demand – Based on all customers orders at current price 23 August, 2010 Results 1 (1000 Reps*) Q1 Q2 Long tail in Delivery times 60 Frequency Frequency 80 40 20 0 8.2 8.25 8.3 8.35 8.4 8.45 Revenue change Q3 40 20 0 200 8.5 250 300 350 400 450 Days until completion Q4 9 x 10 80 Frequency 40 Frequency 60 30 20 10 60 40 20 0 1.3375 1.338 1.3385 1.339 1.3395 1.34 1.3405 1.341 0 −8.842 −8.84 −8.838−8.836−8.834−8.832 −8.83 −8.828 Total supplier cost Additional profits 10 x 10 *Q3 histogram based on 350 reps 10 x 10 24 August, 2010 Results 2: Execution Times (Min) Query 1 rep 10 reps 100 reps 1000 reps Q1 25 25 25 28 Q2 36 35 36 36 Q3 37 42 87 222* Q4 42 45 60 214 vs 25000, 36000 *Based on 350 reps • Much faster than naïve method in all cases 25 August, 2010 Outline • • • • Motivation MCDB MC3: MCDB + map-reduce Future directions 26 August, 2010 Motivation • Exploit massive parallelism of MCDB computations – Extend domain of applicability • Faster path to market? – Forward-looking architecture • Handle semi-structured, nested data – E.g., web-click example: Petabytes of log file data • Local expertise/interest in map-reduce – Learning experience for interesting analytical problem – MCDB computations often CPU-intensive – Ease of prototyping 27 August, 2010 Technical Issues • • • • How to represent bundles? How to specify map-reduce jobs? How to parallelize? How to seed tuple bundles? 28 August, 2010 A Cluster-Computing Infrastructure www.jaql.org //code.google.com/p/jaql High-level query language for semi-structured JSON data Jaql Map-Reduce Parallel batch processing Hadoop HDFS Distributed File System Initial prototype built in a few weeks 29 August, 2010 Map-Reduce Overview Ex: parallel word counting (NULL, “This is a line of text”) Partitioned Input File: Partitioned Output File: [(“This”,1),…,(“text”,1)] [(K, V)] (“This”, [1,1,…,1]) (K, V) M1 • Programmer focus: – Map: (K,V) → [(Km,Vm)] – Reduce: (Km, [Vm]) → [(Kr,Vr)] • System provides: – Parallelism – Sorting – Synchronization – Fault tolerance – Resource allocation On commodity hardware [Vr] [(Km, Vm)] [(Kr, Vr)] (Km, [Vm]) M2 R1 M3 R2 M4 [(“This”,528),(“is”,2000),…] August, 2010 30 MCDB Example CID Region Region Scale CID Shape 102 NewEngland NewEngland 7.0 102 1.2 226 Midwest Midwest 2.1 226 0.7 CUST_ATTR Q: SELECT SUM(Amount) FROM SALES AS t_sales AMT_SCALE CID SALES AMT_SHAPE CID Shape Scale 102 1.2 7.0 226 0.7 2.1 Amount VG function Gamma(shape, scale) CID d1 Amount CID 102 $120.00 226 $60.00 Q(d1) = $180 d2 d3 Amount CID 102 $80.00 102 $80.00 226 $90.00 226 $130.00 Q(d2) = $170 Amount Q(d3) = $210 ˆ ˆ E[t_sales] = $186.67 STD[t_sales] = $20.82 August, 2010 31 JSON and MC3 [{cid: 102, region: NewEngland}, …] Join + Project [{cid: 102, shape: 1.2, scale: 7.0}, …] Seed [{cid: 102, shape: 1.2, scale: 7.0, seed: 306576301}, …] Instantiate [{cid: 102, shape: 1.2, scale: 7.0, amount: { seed: 306576301, samples: [$120.30, $65.00, … ] }, isPresent: [T, T, … ] }, …] August, 2010 32 JAQL and MC3: Example 1 $cust = READ(hdfs(‘cust_attr’)); $shape = READ(hdfs(‘amt_shape’)); $scale = READ(hdfs(‘amt_scale’)); 2 JOIN $shape, $cust, $scale WHERE $shape.cid == $cust.cid AND $cust.region == $scale.region INTO {$shape, $scale} //Seed 3 → TRANSFORM { $.*, seed: GetSeed() } //Instantiate: generate array of 1000 samples 4 → TRANSFORM GenAmounts($.seed, $.shape, $.scale, 1000) // Sum all sales tuple bundles 6 → GROUP INTO ArraySum($) // Compute the distribution 7 → TRANSFORM Distribution($) 8 → WRITE(hdfs(‘result’)); 33 August, 2010 Example of a Query Plan 1. 2. 3. Final ArraySum Distribution Write ‘result’ Job 3 Reduce Map 1. 2. 3. GetSeed GenAmounts Partial ArraySum Join (region) Job 2 Read ‘AMT_SCALE’ Read Join (cid) Read ‘CUST_ATTR’ Read ‘AMT_SHAPE’ Job 1 34 August, 2010 Parallelism Schemes • Inter-tuple parallelism Tuple 1: (r1,…,r1000) … – Partition tuple bundles among nodes Tuple 2: (r1,…,r1000) – Natural fit with Map-Reduce – Good when many bundles or cheap VG functions • Intra-tuple parallelism Tuple 1: (r1,…,r500) – Split up tuple bundles • Break Monte Carlo replications into chunks – Apply inter-tuple parallelism methods to chunks – Good when few tuples with Tuple 2: (r1,…,r500) Tuple 2: (r501,…,r1000) … • Expensive VG functions and/or • Many MC replications Tuple 1: (r501,…,r1000) 35 August, 2010 Distributed Seeding Tuple 1 Tuple 2 Tuple n … • Must avoid overlapping seed sequences • Maximize parallelization (tuples on different processors) • Minimize seed size stored in each tuple 36 August, 2010 Skip-Ahead Method • • • Well512a generator: period = 2512 Assume inter-tuple parallelism (for simplicity) Assume that we know (or have good upper bound for) – # of bundles seeded per node (= b) – # of seeds per VG function call (= c) – # MC reps (= n) Instantiation Tuple j at node i: Make m=b×i+j skips of length c × n to get to starting point Seeding Tuple j at node i: {cid: 102, shape: 1.2, scale: 7.0} {cid: 102, shape: 1.2, scale: 7.0, seed: [i, j] } Actually, only O(log m) skips needed: pre-compute Skip factors 37 August, 2010 Scale-up Results: Inter-Tuple Parallelism • Implemented two nontrivial queries from MCDB paper Running Time (s) – Jaql: Map-Reduce plan = original MCDB plan – Good scalability with inter-tuple parallelism 1850 1600 Q4 Q1 1350 1100 850 600 0 1 2 3 4 5 6 7 Number of Servers 8 9 10 38 August, 2010 Speed-up Results: Intra-Tuple Parallelism • Implemented two call-option queries (Euro and Asian) – Euro option: expensive VG function, good speed-up – Asian option: cheap VG function, speed-up curve flattens • Sequential merging of chunks starts to dominate – Moral: choose appropriate parallelization scheme 80 72 64 Ideal Speedup European Option Asian Option Speedup 56 48 40 32 24 16 8 4 0 0 4 8 16 24 32 40 48 Number of Cores August, 2010 56 64 72 80 39 Outline • • • • Motivation MCDB MC3 Future directions 40 August, 2010 An End-to-End ERP Scenario ProbIE Automobile problem reports (text) My S-Class slipped out of gear … Requirements for mechanics and parts (safety margin) My S-Class slipped out of gear … Tire Problem (0.2) Transmission problem (0.9) Probabilistic BI querying SELECT COUNT(REPORTS) WHERE P_TYPE = ‘transmission’ 41 August, 2010 Future Directions • Tail Sampling – Extreme-quantile (VAR) estimation and more – “Gibbs cloner” approach • Performance – Query optimization • E.g., push down inference & instantiation, choose parallelization scheme • Improve JAQL rewriter (MC3 aware)? • Re-use of partial results, multi-query optimizations? – Sequential and/or adaptive simulation? (MC3) – Combine with exact methods? Sampling? – Indexing, etc. • Functionality – User-defined precision – Semi- and unstructured data – Robust, full-featured re-implementation (underway) • Possible Applications – Automotive ERP – Health records 42 August, 2010 Related Projects • RAQA: Resolution-aware query answering for Business Intelligence [Sismanis et al., ICDE09] – – – – – • State Strict range Status San Francisco CA [$30,$230] guaranteed San Jose CA [$70,$200] non-guaranteed Sum(Sales) group by City,State State Strict range Status CA [$230,$230] guaranteed Sum(Sales) group by State ProbIE: Probabilistic info extraction [Michelakis et al., SIGMOD09] – – – – • Uncertainty due to entity resolution OLAP querying (roll-up, drill-down) Bounds on query answers Implemented via SQL queries Conservative approach City For rule-based IE system (e.g., SystemT) Provides confidence #’s for base/derived annotations Based on “rule history”, lower-level results MaxEnt-based learning approach probIE Learning phase Other Monte Carlo/Statistics analysis in Hadoop (XAP) – – – Operational risk calculations RICARDO: synthesis of R and Hadoop [Das et al., SIGMOD10] Recommender systems Statistical model Annotator rules Labeled training data Rule features Text Annotator Annotation + Rule history Annotation probability Deployment phase 43 August, 2010 Further Details: • • • • MCDB: SIGMOD 2008 MC3: SIGMOD 2009 ProbIE: SIGMOD 2009 MCDB-R: VLDB 2010 www.almaden.ibm.com/cs/people/peterh [email protected] Thank You! 44 August, 2010 Backup Slides 45 August, 2010 Clinic-Capacity Risk Medical data for all customers Stochastic dosage model Pharmacy data for all customers Cox hazard-rate disease model Clinic-resource demand model CustID Time period Resource needed Jane Smith June-Sept ? … … 46 August, 2010 Individual Click Behavior (EBay) Click data for all EBay customers p1 x13 p3 x34 x14 p4 x24 x32 p2 Data for one customer Global Markov model distribution (Dirichelet prior) • p1 y13 p3 y34 y14 p4 y24 y32 p2 Individual Markov model distribution (posterior) Can analyze arbitrary dynamic customer segments when determining effect of changing EBay pages 47 August, 2010 Logistics Under Uncertainty • • Retailer: ship from warehouses to outlets today or tomorrow? Deterministic tables In_Stock Shipment • Current_Price ITEM_ID QUANTITY ITEM_ID QUANTITY ITEM_ID Price curtains 50 curtains 20 curtains $120 … … … … … … Random tables Sales_W_Ship • Sales_WO_Ship CUST_ID ITEM_ID QUANTITY CUST_ID ITEM_ID QUANTITY Smith curtains ? Smith curtains ? … … … … Queries: SELECT SUM (c.price * s.quantity) FROM SALES_W_SHIP s, CUR_PRICE c WHERE c.ITEM_ID = s.ITEM_ID • SELECT SUM (c.price * s.quantity) FROM SALES_WO_SHIP s, CUR_PRICE c WHERE c.ITEM_ID = s.ITEM_ID Issues: – – Complicated statistical models for purchase quantity (how to integrate in DB?) Uncertainty (random tables) depend dynamically on huge number of parameters August, 2010 48 Data Uncertainty - Continued Anonymization {JohnSmith, age 42} Privacy Filter Name Age John Smith Between 40 and 50 Measurement Uncertainty Sensor Sensor_ID Temp (F) S23 78.32 f(t) 78.32 t f(t) System Monitor Event Time Buffer overflow 10/17/2007:18:20:02 t 49 August, 2010 VG Function Implementation • C++ class with four public methods – Initialize: set up data structures, seed RNG – TakeParams: read in “parameter vector” – OutputVals: return random value(s) for possible world • Return NULL when done If newRep: newRep = false uniform = myRanDGen() probSum = i = 0 while (uniform >= probSum) i++ probSum += L[i].wt / totWeight return L[i].val Else newRep = true return NULL OutputVals method For DiscreteChoice() – Finalize: clean up 50 August, 2010 Schema Syntax: Example 1 • Goal: generate random customer table – MONEY, LIVES_IN are uncertain attributes – MONEY has Gamma dist’n • shift, shape, scale parameters – Use DiscreteChoice for LIVES_IN value – Customers are mutually independent, given region • Parameter table schemas – CUST (CID, GENDER, REGION) – CITIES (NAME, REGION, PROB) • Probabilities sum to 1 in each region – MONEY_SHIFT (SHIFT) Normalized – MONEY_SCALE (REGION, SCALE) storage – MONEY_SHAPE (CID, SHAPE) 1 row, 1 column 51 August, 2010 Schema Syntax: Example 2 • Suppose MONEY and LIVES_IN are correlated CREATE TABLE RAND_CUST (CID, GENDER, MONEY, LIVES_IN) AS FOR EACH d in CUST WITH MLI AS MyJointDistribution(…) SELECT d.CID, d.GENDER, MLI.V1, MLI.V2 FROM MLI MLI has 1 row, 2 columns 52 August, 2010 Schema Syntax: Example 3 • Correlated sensors – Sensors in same “sensor group” are correlated (multivariate normal) • Parameter table schemas – S_PARAMS (ID, LAT, LONG, GID) – MEANS (ID, MEAN) – COVARS (ID1, ID2, COV) CREATE TABLE SENSORS (ID, LAT, LONG, TEMP) AS FOR EACH g in (SELECT DISTINCT GID FROM S_PARAMS) WITH TEMP AS MDNormal ( (SELECT m.ID, m.MEAN FROM MEANS m S_PARAMS ss WHERE m.ID = ss.ID AND ss.GID = g.GID), (SELECT c.ID1, c.ID2, c.COV FROM COVARS c, S_PARAMS ss WHERE c.ID1 = ss.ID AND ss.GID = g.GID) ) SELECT s.ID, s.LAT, s.LONG, t.VALUE FROM S_PARAMS s, TEMP t WHERE s.ID = t.ID August, 2010 53 Instantiate Operation output pipe M ergeseed π VG Atts∪ { seed} V G Function M ergeseed Sortseed π InAtts1 ∪ { seed} π InAtts ∪ { seed} π InAtts ∪ { seed} 2 3 π O utAtts∪ { seed} B3 B2 B1 pipe fork Q in,1 VG function args Q in,2 Q in,3 Q out “inner” inputpipes “outer” inputpipe For-each clause 54 August, 2010 Q4 Details • Effect on profits of 5% price increase – Want more accuracy than usual aggregated demand functions • E.g, exploit detailed point-of-sale data – For each part • • • • Fit “prior” demand-function distribution to all customers (MLE) Determine “posterior” distribution for each cust. (Bayes Thm) Generate random demand for each customer at new price Use rejection algorithm to sample from posterior P Gamma(a,b) { Q Gamma(c,d) August, 2010 55 Multi-PRNG Method • • • When # of seeds per VG function call is unknown When skip-ahead for huge PRNG is hard to implement Collisions possible, but probability < 10-17 Shared by All nodes s0 Instantiation of tuple j Seeding at node i bundle j 6 ints x [# bundles at nodes 0 to (i-1)] 16 G1 6 ints (small) int s G3 (medium) G2 (medium) 4 ts in G4 (huge) bundle j 56 August, 2010 Nested-Data Experiments • TPC-H schema is used • Two different ways to nest data – Nest lineitem table under orders table – Nest lineitem table under partsupp table • Modified version of Q4 from MCDB paper – Compare MC3 execution time to flat scheme – First nesting scheme: running time is slower – Second nesting scheme: running time is faster • Only uncertain “leaf attributes” are supported 57 August, 2010 Probabilistic Information Extraction in a Rule-Based System Motivation: System T Hand-crafted rules for specific domain: Annotator Person Base annotator Candidate-Generation Rules Rule Precision P1: <Salutation><CapitalizedWord><CapitalizedWord> P2: <First Name Dictionary><Last Name Dictionary> P3: <CapitalizedWord><CapitalizedWord> High High Low PhoneNumber Base annotator Ph1: <PhoneClue><\d{3}-\d{3}-\d{4}> Ph2: <\d{3}-\d{3}-\d{4}> Ph3: <\d{5} High Medium Low PersonPhone PP1: <Person><“can be reached at”><PhoneNumber> PP2: <“call”><Person><0-2 tokens><PhoneNumber> PP3: [<Person><PhoneNumber>]sentence High High Medium Derived annotator + Consolidation rule Consolidate(“Joe Smith”, “Mr. Joe Smith”) = “Mr. Joe Smith” 58 August, 2010 Annotations Goal: Attach probabilities to annotations in a principled, scalable manner 59 August, 2010 Quantifying this uncertainty is critical as • Extracted facts can then be queried using probabilistic databases • Confidence numbers can be used by information integration and search applications • It helps in improving the recall of annotators!! 60 August, 2010 Our approach • Propose a probabilistic framework for handling uncertainty in rule-based IE – Each annotation is associated with a confidence • the probability that the annotation is correct – Probability is obtained by augmenting each annotator with a statistical model • Design considerations – Applicable to grammar and declarative rule-based IE systems – Scale to annotators with a large number of (correlated) rules – Support incremental improvements in accuracy of probability estimates • as rules, data, or constraints are added 61 August, 2010 Rule Histories and Features • Rule history P1: <Salutation><CapitalizedWord><CapitalizedWord> P2: <First Name Dictionary><Last Name Dictionary> P3: <CapitalizedWord><CapitalizedWord> Please call Heather Choate at span P1 r = ( 0, P2 P3 1 , 1) Rule history • Rule features – Qualitative correlations and anti-correlations – Ex: “Rules P1 and P2 tend to occur together” 62 August, 2010 ProbIE Framework (Base Annotator) Annotator rules Labeled training data Rule features Text probIE Annotator Statistical model Learning phase Consolidated span + Rule history Annotation probability Extraction (deployment) phase 63 August, 2010 Probability Model of Uncertainty • Binary random variables associated with text and annotator – – – A(s) = 1 iff span s is actually a Person K(s) = 1 iff span s is annotated as a Person by consolidator R(s) = (R1(s),R2(s),…,Rk(s)) is stochastic rule history on span s • • Ri(s) = 1 iff ith rule holds at least once on span s Annotation probability: q(r) = P(A(s) = 1 | R(s) = r, K(s) = 1) • Indirect approach (estimate a prob dist’n rather than many small probs) – Estimate p0(r) = P(R(s) = r | A(s) = 0, K(s) = 1) p1(r) = P(R(s) = r | A(s) = 1, K(s) = 1) u q(r) = πp1(r) πp1(r) + (1 − π)p0 (r) = P(A(s) = 1 | K(s) = 1) – – – is easy to estimate empirically Serious data-sparsity problem for p0 and p1: 2k possible histories, little training data Solution: Fit a parametric model 64 August, 2010 A Parametric Model • Parametric exponential model for p1 (model for p0 is similar): – Recall: p1(r) = P(R(s) = r | A(s) = 1, K(s) = 1) with R(s) = (R1(s),…,Rk(s)) – From features to constraints P(R3(s) = 1 | A(s) = 1, K(s) = 1) = a3 (one marginal constraint per rule) P(R2(s) = 1 and R7(s) = 1 | A(s) = 1, K(s) = 1) = a2,7 (important correlations) – – where constants a3, a2,7, etc. computed from training data Approximate p1 by “simplest” (maximum entropy) distribution satisfying constraints Equivalent to maximum-likelihood fit of parameter vector for exponential distribution p1(r; θ) = – 1 exp Z(θ) {∑ c∈C } θc fc (r) fc = Indicator function for constraint c Use improved iterative scaling (IIS) to fit from training data • Model-decomposition methods for IIS scalability to many rules and constraints • Augment training data to handle constraints with 0 right-hand side • Methodology extends to derived annotators such as PersonPhone 65 August, 2010 Some Experimental Results (Pay-As-You-Go) Person annotator (No inter-rule constraints) Person annotator (4 inter-rule constraints) 66 August, 2010