talk

Stochastic Analytics
in the Database
Peter J. Haas
Kevin Beyer
Vuk Ercegovac
Bo Shekita
Chris Jermaine*
Ravi Jampani
Luis Perez
Mingxi Wu
Fei Xu
IBM Almaden Research Ctr.
U. Florida Gainesville
*Rice University
1
August, 2010
Outline
•
•
•
•
Motivation via examples
MCDB: Monte Carlo Database System
MC3: MCDB + map-reduce
Future directions
2
August, 2010
Problem Setting
?
?
• Large data sets
?
• Missing or uncertain data
• Stochastic models used to “guess” values
– Model gives probability distribution on data values
• Want to run BI queries over guessed values
• Want to assess uncertainty in query answers
– Risk assessment
– Decisionmaking
3
August, 2010
Ex. 1: Portfolio Values
Customer
EuroCallOptions
CustID
OptionID
NumShares
…
OptionID
InitVal
…
StrikeP
OVal
John Smith
23
50
…
23
$2.35
…
$4.00
?
…
…
…
…
…
…
…
SELECT SUM (c.NumShares * o.Val)
FROM Customer c, EuroCallOptions o
WHERE c.OptionID = o.OptionID
AND c.CustType = ‘Institutional’
…
Option value
one month from now
(exercise date)
Modified Black-Scholes model for European call option:
(
)
OVal = max (V (t final ) − S,0 )
dV = rV dt + a V V dW
Simulation approximation (Euler formula):
(
)
V (t + Δt ) = V (t ) + rV (t )Δt + a V (t ) V (t ) ΔtZ j
Sample from
Normal dist’n
4
August, 2010
Ex. 2: Pricing Decisions
CustID
Unit
Price
Order
Amount
J. Smith
$10.20
500
…
…
…
•
•
•
Data for one
customer
price
Data for all
customers
price
Bayes Theorem
demand
demand
Global demand
distribution (prior)
Individual demand
distribution (posterior)
Can analyze arbitrary dynamic customer segments when determining
effect of price increase
Similar approach for web-click behavior (EBay, Websphere portal)
Issues: Complex model, huge number of dynamic parameters
5
August, 2010
Ex. 3: Data-Warehouse Uncertainty
Data Integration
{John Smith, San Jose}
{John Smith, San Jose}
ETL
{John Smith, Los Angeles}
Name
John Smith
Name
City
John Smith
(SJ, 0.66), (LA, 0.33)
City
LA
Name
Sales
J. Smith
$50K
Information extraction
A lovely thing to behold
is Paris Hilton in the
Springtime …
09/09/2007
Re: system crash
-------------------------This morning, my ORACLE
system on LINUX exploded
in a spectacular fireball …
Similarity
Join
Hotels
System T
Hotel Annotator
Text Miner
Sales
LA
$50K
? (0.92)
(Michelakis et al., 2009)
NY Marriott
Paris Hilton
City
? (0.20)
Source
Problem type
Cust0385
(DBMS, 0.8), (OS, 0.2)
6
August, 2010
Risk Due to Data Uncertainty
• Ex: Value of assets (for financial reporting,
compliance, business-process monitoring)
probability
SELECT SUM (s.amount)
FROM SALES s, CUST c
WHERE s.ID = c.ID
AND c.city = ‘Los Angeles’
Query-result
distribution
5%
VaR
expected
answer
Total LA sales
• Ex: ERP
– # OS experts needed for help desk
– Based on (uncertain) extracted text data from last year
– Provide principled safety factor
7
August, 2010
Traditional Workflow
Arena, R, Matlab,…
Arena, R, Matlab,…
Model
Model
Data
reduction
Analyst (PhD)
Develops model
Model fitting
•
•
•
Data extraction slow and bug-prone
Only coarse-grained modeling
No encapsulation for user
Model application
& querying
•
•
•
Hard to re-link model results to DB
Hard to deal with data updates
Sensitivity, what-if analysis are hard
Goal: Integrate model with Database
August, 2010
Model
8
Prior Work
• MauveDB (Deshpande & Madden 06)
– Continuous deterministic models only
• Probabilistic databases
(Trio, MayBMS, ORION, MystiQ, K-relations, et al.)
–
–
–
–
–
–
For data-warehouse uncertainty
Hard-wired, limited uncertainty model (deterministic skeleton)
Limited queries (top-k)
Complexity issues
Independence assumptions
What-if analysis is hard
9
August, 2010
Outline
•
•
•
•
Motivation
MCDB: Monte Carlo Database System
MC3
Future directions
10
August, 2010
The MCDB System
i.i.d. samples from
possible-worlds
dist’n
Random DB = D
Schema
VG Functions
Parameter
Tables
d1
Monte Carlo
Generator
Q
d2
...
Q(D) =
Select SUM(sales)
AS t_sales
Q
Q(d1)
Q(d2)
:
Q(dn)
Q
i.i.d. samples from
query-result
dist’n
dn
ˆ [ t_sales ]
E
ˆ [ t_sales ]
Var
qˆ .01 [ t_sales ]
Histogram
Error bounds
Inference
Monte
Carlo
Estimator
11
August, 2010
MCDB Example
CID
Region
CID
Shape
Region
Scale
102
NewEngland
102
1.2
NewEngland
7.0
226
Midwest
226
0.7
Midwest
2.1
CUST_ATTR
Q: SELECT SUM(Amount)
FROM SALES
AS t_sales
AMT_SHAPE
CID
SALES
AMT_SCALE
CID
Shape
Scale
102
1.2
7.0
226
0.7
2.1
Amount
VG function
Gamma(shape, scale)
CID
d1
Amount
CID
102
$120.00
226
$60.00
Q(d1) = $180
d2
d3
Amount
CID
102
$80.00
102
$80.00
226
$90.00
226
$130.00
Q(d2) = $170
Amount
Q(d3) = $210
ˆ
ˆ
E[t_sales]
= $186.67 STD[t_sales]
= $20.82
August, 2010
12
Advantages of MCDB
•
Flexible and extensible uncertainty model
– Can capture extended relational models (Trio, MayBMS, Mystiq,…)
– Can capture huge range of stochastic models
•
Can bring complex stochastic models to data (no extraction needed)
•
Encapsulates complexity
– Once expert has written VG function, naïve user can run queries
•
Can handle arbitrary SQL queries
•
What-if analysis, sensitivity analysis, data updates are easy
13
August, 2010
VG Functions
Value
Weight
San Jose
0.66
San Francisco
0.34
Value
DiscreteChoice()
parameter table
San Jose
output table
(instance)
Pseudorandom #
seed
•
Used to generate instances of values in random tables
– Parameter tables are standard relational tables (can index, etc.)
– Library of standard functions: DiscreteChoice, Normal, Poisson, …
– Can define custom functions (similar to UDFs)
14
August, 2010
Pseudorandom Number Generators
(PRNG)
•
Needed by VG function
– E.g., to generate “random” sales values
•
Produces a deterministic sequence of seeds
– Appears random
– Cycles around
•
..
Typical PRNG recurrence:
– Si+1 = M * Si mod m
– Seed S = vector of k unsigned integers
– M is a matrix
•
•
Sn-1S0 S1 S
2
PRNG Cycle of Seeds
Transform seeds to desired random samples
Cycle usually “split” into disjoint segments
– Skip factor
•
Keeping only initial seed, S0, is sufficient to
regenerate sequence
15
August, 2010
.
VG Functions and Correlation
ID1
ID2
Cov
1
1
1.23
1
2
0.17
V1
V2
2
2
2.45
1.21
2.13
MDNormal()
Correlated
columns
or
ID
Mean
1
3.68
ID
Val
2
4.75
1
1.21
2
2.13
Correlated
rows
Pseudorandom #
seed
16
August, 2010
Schema Syntax: Example
CREATE TABLE RAND_CUST (CID, GENDER, MONEY, LIVES_IN) AS
FOR EACH d in CUST
WITH MONEY AS Gamma(
(SELECT n.SHAPE FROM MONEY_SHAPE n WHERE n.CID = d.CID),
(SELECT sc.SCALE FROM MONEY_SCALE sc WHERE sc.REGION =
d.REGION),
(SELECT SHIFT FROM MONEY_SHIFT)
)
WITH LIVES_IN AS DiscreteChoice (
(SELECT c.NAME, c.PROB
FROM CITIES c
WHERE c.REGION = d.REGION)
)
SELECT d.CID, d.GENDER, m.VALUE, l.VALUE
FROM MONEY m, LIVES_IN l
17
August, 2010
Query Processing
• Naïve approach
– Repeatedly instantiate DB and run query
– Has horrible performance
• MCDB approach
– Execute query plan once
– Process tuple bundles instead of tuples
• Represents tuple in all simulated possible worlds (MC reps)
• Permits a variety of performance optimizations
18
August, 2010
Tuple Bundles (4 MC Repetitions)
(Jane, Smith, 20)
(Jane, Smith, 21)
-(Jane, Smith, 21)
Tuple bundle
isPresent
(Jane, Smith, (20,21,x,21), (T,T,F,T), Seed)
(Jane, Smith, (T,T,F,T ), Seed)
Representation
Compressed representation
Basic ideas: (a) Keep bundles in compressed form whenever possible
(b) Apply selections early to compressed bundles
(c) Amortize I/O, network costs, etc. over multiple reps
19
August, 2010
Operations on Tuple Bundles
• Seed:
(Jane, Smith, --, --) u
(Jane, Smith, --, --, Seed)
• Split:
(Jane, Smith, (20,21,20,21), (T,T,T,T), Seed) u
(Jane, Smith, 20, (T,F,T,F), Seed),
(Jane, Smith, 21, (F,T,F,T), Seed)
• Inference:
(Jane, Smith, (20,21,20,21), (T,T,T,T), Seed) u
(Jane, Smith, 20, 0.5),
(Jane, Smith, 21, 0.5)
August, 2010
Also: Aggregate
20
Standard Operations
• Select (FNAME = ‘Jane’ AND AGE = 20)
(Jane, Smith, (20,21,20,21), (F,T,T,T), Seed)
(John, Jones, (32,31,20,30), (T,T,F,T), Seed)
(Jane, Jones, (21,23,22,22), (T,T,T,T), Seed) u
(Jane, Smith, (20,21,20,21), (F,F,T,F), Seed)
• Join (equijoin on Department #)
Uses SPLIT
+
sort-merge
(Smith, (D1,D2,D2,D1), (F,T,T,T), Seed1)
(Jones, (D1,D2,D2,D2), (T,T,F,T), Seed2) u
(Smith, D2, Jones, D2, (F,T,F,F), Seed1, Seed2)
21
August, 2010
Estimation and Inference
Distinct
tuple values
MCDB inference
operator
WITH Stats(Mu, Var) AS (
SELECT SUM(Val1*Frac),
SUM(Val*Val1*Frac)
- SUM(Val1*Frac)*SUM(Val1*Frac)
FROM OutputTable)
SELECT Mu AS Mean, SQRT(Var) AS Stdev,
1.96*SQRT(Var)/SQRT(1000) AS CIHW
FROM Stats
SQL queries
TotSales
Frac
20K
0.324
…
…
Frac. replications
where
value appears
(vs bit vector)
OutputTable
WITH CumDistFn(TotSales, Cum, PrevCum) AS (
SELECT TotSales,
SUM(Frac) OVER (ORDER BY TotSales
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW),
SUM(Frac) OVER (ORDER BY TotSales
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING)
FROM OutputTable)
SELECT Val FROM CumDistFn
WHERE Cum >= 0.5 AND PrevCum < 0.5
August, 2010
22
Experimental Queries
•
Q1: Next year’s revenue gain from Japanese products
– Assuming current trends hold
– Each order duplicated Poisson # of times
– Poisson mean = (this year)/(last year) for customer
•
Q2: Order Delays
– From placement to delivery
– Fitted Gamma distribution for each delay type (for each part)
•
Q3: What if we had used cheapest supplier?
– TPC-H only has current prices
– Prior prices generated by backwards random walk with drift
•
Q4: Change in profits with 5% price increase
– Bayesian model of customer demand
– Based on all customers orders at current price
23
August, 2010
Results 1 (1000 Reps*)
Q1
Q2
Long tail in
Delivery times
60
Frequency
Frequency
80
40
20
0
8.2
8.25
8.3
8.35
8.4
8.45
Revenue change
Q3
40
20
0
200
8.5
250
300
350
400
450
Days until completion
Q4
9
x 10
80
Frequency
40
Frequency
60
30
20
10
60
40
20
0
1.3375 1.338 1.3385 1.339 1.3395 1.34 1.3405 1.341
0
−8.842 −8.84 −8.838−8.836−8.834−8.832 −8.83 −8.828
Total supplier cost
Additional profits
10
x 10
*Q3 histogram based on 350 reps
10
x 10
24
August, 2010
Results 2: Execution Times (Min)
Query
1 rep
10 reps
100 reps
1000 reps
Q1
25
25
25
28
Q2
36
35
36
36
Q3
37
42
87
222*
Q4
42
45
60
214
vs 25000,
36000
*Based on 350 reps
•
Much faster than naïve method in all cases
25
August, 2010
Outline
•
•
•
•
Motivation
MCDB
MC3: MCDB + map-reduce
Future directions
26
August, 2010
Motivation
• Exploit massive parallelism of MCDB computations
– Extend domain of applicability
• Faster path to market?
– Forward-looking architecture
• Handle semi-structured, nested data
– E.g., web-click example: Petabytes of log file data
• Local expertise/interest in map-reduce
– Learning experience for interesting analytical problem
– MCDB computations often CPU-intensive
– Ease of prototyping
27
August, 2010
Technical Issues
•
•
•
•
How to represent bundles?
How to specify map-reduce jobs?
How to parallelize?
How to seed tuple bundles?
28
August, 2010
A Cluster-Computing Infrastructure
www.jaql.org
//code.google.com/p/jaql
High-level query
language for
semi-structured JSON
data
Jaql
Map-Reduce
Parallel batch
processing
Hadoop
HDFS
Distributed File
System
Initial prototype built in a few weeks
29
August, 2010
Map-Reduce Overview
Ex: parallel word counting
(NULL, “This is a line of text”)
Partitioned
Input File:
Partitioned
Output File:
[(“This”,1),…,(“text”,1)]
[(K, V)]
(“This”, [1,1,…,1])
(K, V)
M1
•
Programmer focus:
– Map: (K,V) → [(Km,Vm)]
– Reduce:
(Km, [Vm]) → [(Kr,Vr)]
•
System provides:
– Parallelism
– Sorting
– Synchronization
– Fault tolerance
– Resource allocation
On commodity hardware
[Vr]
[(Km, Vm)]
[(Kr, Vr)]
(Km, [Vm])
M2
R1
M3
R2
M4
[(“This”,528),(“is”,2000),…]
August, 2010
30
MCDB Example
CID
Region
Region
Scale
CID
Shape
102
NewEngland
NewEngland
7.0
102
1.2
226
Midwest
Midwest
2.1
226
0.7
CUST_ATTR
Q: SELECT SUM(Amount)
FROM SALES
AS t_sales
AMT_SCALE
CID
SALES
AMT_SHAPE
CID
Shape
Scale
102
1.2
7.0
226
0.7
2.1
Amount
VG function
Gamma(shape, scale)
CID
d1
Amount
CID
102
$120.00
226
$60.00
Q(d1) = $180
d2
d3
Amount
CID
102
$80.00
102
$80.00
226
$90.00
226
$130.00
Q(d2) = $170
Amount
Q(d3) = $210
ˆ
ˆ
E[t_sales]
= $186.67 STD[t_sales]
= $20.82
August, 2010
31
JSON and MC3
[{cid: 102, region: NewEngland}, …]
Join + Project
[{cid: 102, shape: 1.2, scale: 7.0}, …]
Seed
[{cid: 102, shape: 1.2, scale: 7.0, seed: 306576301}, …]
Instantiate
[{cid: 102, shape: 1.2, scale: 7.0,
amount: { seed: 306576301,
samples: [$120.30, $65.00, … ] },
isPresent: [T, T, … ]
}, …]
August, 2010
32
JAQL and MC3: Example
1 $cust = READ(hdfs(‘cust_attr’));
$shape = READ(hdfs(‘amt_shape’));
$scale = READ(hdfs(‘amt_scale’));
2 JOIN $shape, $cust, $scale
WHERE $shape.cid == $cust.cid
AND $cust.region == $scale.region
INTO {$shape, $scale}
//Seed
3 → TRANSFORM { $.*, seed: GetSeed() }
//Instantiate: generate array of 1000 samples
4 → TRANSFORM GenAmounts($.seed, $.shape, $.scale, 1000)
// Sum all sales tuple bundles
6 → GROUP INTO ArraySum($)
// Compute the distribution
7 → TRANSFORM Distribution($)
8 → WRITE(hdfs(‘result’));
33
August, 2010
Example of a Query Plan
1.
2.
3.
Final ArraySum
Distribution
Write ‘result’
Job 3
Reduce
Map
1.
2.
3.
GetSeed
GenAmounts
Partial ArraySum
Join (region)
Job 2
Read
‘AMT_SCALE’
Read
Join (cid)
Read
‘CUST_ATTR’
Read
‘AMT_SHAPE’
Job 1
34
August, 2010
Parallelism Schemes
• Inter-tuple parallelism
Tuple 1: (r1,…,r1000)
…
– Partition tuple bundles among nodes
Tuple 2: (r1,…,r1000)
– Natural fit with Map-Reduce
– Good when many bundles or cheap VG functions
• Intra-tuple parallelism
Tuple 1: (r1,…,r500)
– Split up tuple bundles
• Break Monte Carlo replications into chunks
– Apply inter-tuple parallelism methods to chunks
– Good when few tuples with
Tuple 2: (r1,…,r500)
Tuple 2: (r501,…,r1000)
…
• Expensive VG functions and/or
• Many MC replications
Tuple 1: (r501,…,r1000)
35
August, 2010
Distributed Seeding
Tuple 1
Tuple 2
Tuple n
…
• Must avoid overlapping seed sequences
• Maximize parallelization (tuples on different processors)
• Minimize seed size stored in each tuple
36
August, 2010
Skip-Ahead Method
•
•
•
Well512a generator: period = 2512
Assume inter-tuple parallelism (for simplicity)
Assume that we know (or have good upper bound for)
– # of bundles seeded per node (= b)
– # of seeds per VG function call (= c)
– # MC reps (= n)
Instantiation
Tuple j at node i: Make
m=b×i+j
skips of length c × n
to get to starting point
Seeding
Tuple j at node i:
{cid: 102, shape: 1.2, scale: 7.0}
{cid: 102, shape: 1.2, scale: 7.0, seed: [i, j] }
Actually, only O(log m)
skips needed:
pre-compute
Skip factors
37
August, 2010
Scale-up Results:
Inter-Tuple Parallelism
• Implemented two nontrivial queries from MCDB paper
Running Time (s)
– Jaql: Map-Reduce plan = original MCDB plan
– Good scalability with inter-tuple parallelism
1850
1600
Q4
Q1
1350
1100
850
600
0
1
2
3
4
5
6
7
Number of Servers
8
9
10
38
August, 2010
Speed-up Results:
Intra-Tuple Parallelism
• Implemented two call-option queries (Euro and Asian)
– Euro option: expensive VG function, good speed-up
– Asian option: cheap VG function, speed-up curve flattens
• Sequential merging of chunks starts to dominate
– Moral: choose appropriate parallelization scheme
80
72
64
Ideal Speedup
European Option
Asian Option
Speedup
56
48
40
32
24
16
8
4
0
0 4 8
16
24
32
40
48
Number of Cores
August, 2010
56
64
72
80
39
Outline
•
•
•
•
Motivation
MCDB
MC3
Future directions
40
August, 2010
An End-to-End ERP Scenario
ProbIE
Automobile problem
reports (text)
My S-Class slipped out
of gear …
Requirements for
mechanics and parts
(safety margin)
My S-Class slipped out
of gear …
Tire
Problem (0.2)
Transmission
problem (0.9)
Probabilistic
BI querying
SELECT COUNT(REPORTS)
WHERE P_TYPE = ‘transmission’
41
August, 2010
Future Directions
•
Tail Sampling
– Extreme-quantile (VAR) estimation and more
– “Gibbs cloner” approach
•
Performance
– Query optimization
• E.g., push down inference & instantiation,
choose parallelization scheme
• Improve JAQL rewriter (MC3 aware)?
• Re-use of partial results, multi-query optimizations?
– Sequential and/or adaptive simulation? (MC3)
– Combine with exact methods? Sampling?
– Indexing, etc.
•
Functionality
– User-defined precision
– Semi- and unstructured data
– Robust, full-featured re-implementation (underway)
•
Possible Applications
– Automotive ERP
– Health records
42
August, 2010
Related Projects
•
RAQA: Resolution-aware query
answering for Business Intelligence
[Sismanis et al., ICDE09]
–
–
–
–
–
•
State
Strict range
Status
San Francisco
CA
[$30,$230]
guaranteed
San Jose
CA
[$70,$200]
non-guaranteed
Sum(Sales) group by City,State
State
Strict range
Status
CA
[$230,$230]
guaranteed
Sum(Sales) group by State
ProbIE: Probabilistic info extraction
[Michelakis et al., SIGMOD09]
–
–
–
–
•
Uncertainty due to entity resolution
OLAP querying (roll-up, drill-down)
Bounds on query answers
Implemented via SQL queries
Conservative approach
City
For rule-based IE system (e.g., SystemT)
Provides confidence #’s for base/derived
annotations
Based on “rule history”, lower-level results
MaxEnt-based learning approach
probIE
Learning
phase
Other Monte Carlo/Statistics analysis in
Hadoop (XAP)
–
–
–
Operational risk calculations
RICARDO: synthesis of R and Hadoop
[Das et al., SIGMOD10]
Recommender systems
Statistical
model
Annotator rules
Labeled training data
Rule features
Text
Annotator
Annotation +
Rule history
Annotation
probability
Deployment
phase
43
August, 2010
Further Details:
•
•
•
•
MCDB: SIGMOD 2008
MC3: SIGMOD 2009
ProbIE: SIGMOD 2009
MCDB-R: VLDB 2010
www.almaden.ibm.com/cs/people/peterh
[email protected]
Thank You!
44
August, 2010
Backup Slides
45
August, 2010
Clinic-Capacity Risk
Medical data for all
customers
Stochastic
dosage model
Pharmacy data for all
customers
Cox hazard-rate
disease model
Clinic-resource
demand model
CustID
Time period
Resource
needed
Jane Smith
June-Sept
?
…
…
46
August, 2010
Individual Click Behavior (EBay)
Click data for all
EBay customers
p1
x13
p3
x34
x14
p4
x24
x32
p2
Data for one
customer
Global Markov model
distribution (Dirichelet prior)
•
p1
y13
p3
y34
y14
p4
y24
y32
p2
Individual Markov model
distribution (posterior)
Can analyze arbitrary dynamic customer segments when
determining effect of changing EBay pages
47
August, 2010
Logistics Under Uncertainty
•
•
Retailer: ship from warehouses to outlets today or tomorrow?
Deterministic tables
In_Stock
Shipment
•
Current_Price
ITEM_ID
QUANTITY
ITEM_ID
QUANTITY
ITEM_ID
Price
curtains
50
curtains
20
curtains
$120
…
…
…
…
…
…
Random tables
Sales_W_Ship
•
Sales_WO_Ship
CUST_ID
ITEM_ID
QUANTITY
CUST_ID
ITEM_ID
QUANTITY
Smith
curtains
?
Smith
curtains
?
…
…
…
…
Queries:
SELECT SUM (c.price * s.quantity)
FROM SALES_W_SHIP s,
CUR_PRICE c
WHERE c.ITEM_ID = s.ITEM_ID
•
SELECT SUM (c.price * s.quantity)
FROM SALES_WO_SHIP s,
CUR_PRICE c
WHERE c.ITEM_ID = s.ITEM_ID
Issues:
–
–
Complicated statistical models for purchase quantity (how to integrate in DB?)
Uncertainty (random tables) depend dynamically on huge number of parameters
August, 2010
48
Data Uncertainty - Continued
Anonymization
{JohnSmith, age 42}
Privacy Filter
Name
Age
John Smith
Between 40 and 50
Measurement Uncertainty
Sensor
Sensor_ID
Temp (F)
S23
78.32
f(t)
78.32
t
f(t)
System Monitor
Event
Time
Buffer overflow
10/17/2007:18:20:02
t
49
August, 2010
VG Function Implementation
• C++ class with four
public methods
– Initialize: set up data
structures, seed RNG
– TakeParams: read in
“parameter vector”
– OutputVals: return
random value(s) for
possible world
• Return NULL when
done
If newRep:
newRep = false
uniform = myRanDGen()
probSum = i = 0
while (uniform >= probSum)
i++
probSum += L[i].wt / totWeight
return L[i].val
Else
newRep = true
return NULL
OutputVals method
For DiscreteChoice()
– Finalize: clean up
50
August, 2010
Schema Syntax: Example 1
• Goal: generate random customer table
– MONEY, LIVES_IN are uncertain attributes
– MONEY has Gamma dist’n
• shift, shape, scale parameters
– Use DiscreteChoice for LIVES_IN value
– Customers are mutually independent, given region
• Parameter table schemas
– CUST (CID, GENDER, REGION)
– CITIES (NAME, REGION, PROB)
• Probabilities sum to 1 in each region
– MONEY_SHIFT (SHIFT)
Normalized
– MONEY_SCALE (REGION, SCALE)
storage
– MONEY_SHAPE (CID, SHAPE)
1 row, 1 column
51
August, 2010
Schema Syntax: Example 2
• Suppose MONEY and LIVES_IN are correlated
CREATE TABLE RAND_CUST (CID, GENDER, MONEY, LIVES_IN) AS
FOR EACH d in CUST
WITH MLI AS MyJointDistribution(…)
SELECT d.CID, d.GENDER, MLI.V1, MLI.V2
FROM MLI
MLI has 1 row, 2 columns
52
August, 2010
Schema Syntax: Example 3
•
Correlated sensors
– Sensors in same “sensor group” are correlated (multivariate normal)
•
Parameter table schemas
– S_PARAMS (ID, LAT, LONG, GID)
– MEANS (ID, MEAN)
– COVARS (ID1, ID2, COV)
CREATE TABLE SENSORS (ID, LAT, LONG, TEMP) AS
FOR EACH g in (SELECT DISTINCT GID FROM S_PARAMS)
WITH TEMP AS MDNormal (
(SELECT m.ID, m.MEAN FROM MEANS m S_PARAMS ss
WHERE m.ID = ss.ID AND ss.GID = g.GID),
(SELECT c.ID1, c.ID2, c.COV FROM COVARS c, S_PARAMS ss
WHERE c.ID1 = ss.ID AND ss.GID = g.GID)
)
SELECT s.ID, s.LAT, s.LONG, t.VALUE
FROM S_PARAMS s, TEMP t
WHERE s.ID = t.ID
August, 2010
53
Instantiate Operation
output
pipe
M ergeseed
π VG Atts∪ { seed}
V G Function
M ergeseed
Sortseed
π InAtts1 ∪ { seed} π InAtts ∪ { seed} π InAtts ∪ { seed}
2
3
π O utAtts∪ { seed}
B3
B2
B1
pipe fork
Q in,1
VG function
args
Q in,2
Q in,3
Q out
“inner” inputpipes
“outer”
inputpipe
For-each
clause
54
August, 2010
Q4 Details
• Effect on profits of 5% price increase
– Want more accuracy than usual aggregated demand functions
• E.g, exploit detailed point-of-sale data
– For each part
•
•
•
•
Fit “prior” demand-function distribution to all customers (MLE)
Determine “posterior” distribution for each cust. (Bayes Thm)
Generate random demand for each customer at new price
Use rejection algorithm to sample from posterior
P
Gamma(a,b)
{
Q
Gamma(c,d)
August, 2010
55
Multi-PRNG Method
•
•
•
When # of seeds per VG function call is unknown
When skip-ahead for huge PRNG is hard to implement
Collisions possible, but probability < 10-17
Shared by
All nodes
s0
Instantiation of tuple j
Seeding at node i
bundle j
6 ints x [# bundles
at nodes 0 to (i-1)]
16
G1 6 ints
(small)
int
s
G3
(medium)
G2
(medium)
4
ts
in
G4
(huge)
bundle j
56
August, 2010
Nested-Data Experiments
• TPC-H schema is used
• Two different ways to nest data
– Nest lineitem table under orders table
– Nest lineitem table under partsupp table
• Modified version of Q4 from MCDB paper
– Compare MC3 execution time to flat scheme
– First nesting scheme: running time is slower
– Second nesting scheme: running time is faster
• Only uncertain “leaf attributes” are supported
57
August, 2010
Probabilistic Information Extraction in
a Rule-Based System
Motivation: System T
Hand-crafted rules for specific domain:
Annotator
Person
Base
annotator
Candidate-Generation Rules
Rule Precision
P1: <Salutation><CapitalizedWord><CapitalizedWord>
P2: <First Name Dictionary><Last Name Dictionary>
P3: <CapitalizedWord><CapitalizedWord>
High
High
Low
PhoneNumber
Base
annotator
Ph1: <PhoneClue><\d{3}-\d{3}-\d{4}>
Ph2: <\d{3}-\d{3}-\d{4}>
Ph3: <\d{5}
High
Medium
Low
PersonPhone
PP1: <Person><“can be reached at”><PhoneNumber>
PP2: <“call”><Person><0-2 tokens><PhoneNumber>
PP3: [<Person><PhoneNumber>]sentence
High
High
Medium
Derived
annotator
+ Consolidation rule
Consolidate(“Joe Smith”, “Mr. Joe Smith”) = “Mr. Joe Smith”
58
August, 2010
Annotations
Goal: Attach probabilities to annotations
in a principled, scalable manner
59
August, 2010
Quantifying this uncertainty is
critical as
• Extracted facts can then be queried using
probabilistic databases
• Confidence numbers can be used by
information integration and search
applications
• It helps in improving the recall of
annotators!!
60
August, 2010
Our approach
• Propose a probabilistic framework for handling
uncertainty in rule-based IE
– Each annotation is associated with a confidence
• the probability that the annotation is correct
– Probability is obtained by augmenting each annotator with
a statistical model
• Design considerations
– Applicable to grammar and declarative rule-based IE
systems
– Scale to annotators with a large number of (correlated)
rules
– Support incremental improvements in accuracy of
probability estimates
• as rules, data, or constraints are added
61
August, 2010
Rule Histories and Features
• Rule history
P1: <Salutation><CapitalizedWord><CapitalizedWord>
P2: <First Name Dictionary><Last Name Dictionary>
P3: <CapitalizedWord><CapitalizedWord>
Please call Heather Choate at
span
P1
r = ( 0,
P2 P3
1 , 1)
Rule history
• Rule features
– Qualitative correlations and anti-correlations
– Ex: “Rules P1 and P2 tend to occur together”
62
August, 2010
ProbIE Framework
(Base Annotator)
Annotator rules
Labeled training data
Rule features
Text
probIE
Annotator
Statistical
model
Learning phase
Consolidated span +
Rule history
Annotation
probability
Extraction (deployment) phase
63
August, 2010
Probability Model of Uncertainty
•
Binary random variables associated with text and annotator
–
–
–
A(s) = 1 iff span s is actually a Person
K(s) = 1 iff span s is annotated as a Person by consolidator
R(s) = (R1(s),R2(s),…,Rk(s)) is stochastic rule history on span s
•
•
Ri(s) = 1 iff ith rule holds at least once on span s
Annotation probability:
q(r) = P(A(s) = 1 | R(s) = r, K(s) = 1)
•
Indirect approach (estimate a prob dist’n rather than many small probs)
–
Estimate
p0(r) = P(R(s) = r | A(s) = 0, K(s) = 1)
p1(r) = P(R(s) = r | A(s) = 1, K(s) = 1)
u
q(r) =
πp1(r)
πp1(r) + (1 − π)p0 (r)
= P(A(s) = 1 | K(s) = 1)
–
–
–
is easy to estimate empirically
Serious data-sparsity problem for p0 and p1: 2k possible histories, little training data
Solution: Fit a parametric model
64
August, 2010
A Parametric Model
•
Parametric exponential model for p1 (model for p0 is similar):
– Recall: p1(r) = P(R(s) = r | A(s) = 1, K(s) = 1) with R(s) = (R1(s),…,Rk(s))
–
From features to constraints
P(R3(s) = 1 | A(s) = 1, K(s) = 1) = a3
(one marginal constraint per rule)
P(R2(s) = 1 and R7(s) = 1 | A(s) = 1, K(s) = 1) = a2,7 (important correlations)
–
–
where constants a3, a2,7, etc. computed from training data
Approximate p1 by “simplest” (maximum entropy) distribution satisfying constraints
Equivalent to maximum-likelihood fit of parameter vector for exponential distribution
p1(r; θ) =
–
1
exp
Z(θ)
{∑
c∈C
}
θc fc (r)
fc = Indicator function
for constraint c
Use improved iterative scaling (IIS) to fit from training data
•
Model-decomposition methods for IIS scalability to many rules and constraints
•
Augment training data to handle constraints with 0 right-hand side
•
Methodology extends to derived annotators such as PersonPhone
65
August, 2010
Some Experimental Results
(Pay-As-You-Go)
Person annotator
(No inter-rule constraints)
Person annotator
(4 inter-rule constraints)
66
August, 2010