Splash presentation slides (June 2015, Bay Area ACM)

Collaborative Modeling, Simulation, and Analytics
with Splash
IBM Research
Nicole Barberis, Peter J. Haas, Cheryl Kieliszewski, Yinan Li, Paul Maglio,
Piyaphol Phoungphol, Pat Selinger, Yannis Sismanis, Wang-Chiew Tan,
Ignacio Terrizzano, Haidong Xue, SJSU CAMCOS
IBM Research – Almaden
Splash
Smarter Planet Platform
for Analysis and
Simulation of Health
http://researcher.watson.ibm.com/researcher/view_project.php?id=3931
© 2011
2012 IBM Corporation
IBM Research
Some Context: Model-Data Ecosystems
2
© 2012 IBM Corporation
IBM Research
My Two Communities
Modeling and Simulation
3
Information Management & Analytics
© 2012 IBM Corporation
IBM Research
Opportunities for Innovation at the Intersection
MCDB & SimSQL
Splash
Modeling and Simulation
4
Information Management & Analytics
© 2012 IBM Corporation
IBM Research
Some Further Thoughts and Examples [PODS 2014 Tutorial]
(In addition to large-scale scientific environments)
 Data-intensive simulation
– Simulations within databases
– Databases within simulations
– Data harmonization at scale
 Information integration
– Simulation as an information-integration tool
– Combining real and simulated data
 And more!
5
© 2012 IBM Corporation
IBM Research
Motivation for Splash
6
© 2012 IBM Corporation
IBM Research
The Setting: Analytics for Decision Support
“Analytics is…a complete [enterprise] problem solving
and decision making process”
Descriptive Analytics: Finding patterns and
relationships in historical and existing data
Splash
Predictive analytics: predict future probabilities
and trends to allow what-if analysis
Prescriptive analytics: deterministic and stochastic optimization
to support better decision making
© 2012 IBM Corporation
IBM Research
Shallow Versus Deep Predictive Analytics
Extrapolation
United States House Prices
$275,000
$250,000
Actual
prices
$225,000
$200,000
$150,000
NCAR Community
Atmosphere Model (CAM)
$125,000
$100,000
$75,000
$50,000
$25,000
2010
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
1982
1980
1978
1976
1974
1972
$0
1970
Price
$175,000
Year
Extrapolation of 1970-2006
median U.S. housing prices
© 2012 IBM Corporation
IBM Research
Big, Difficult, Important Problems Span Many Disciplines
Need collaborative cross-disciplinary modeling and simulation
Communication
Transportation
$ 3.96 Tn
$ 6.95 Tn
Education
$ 1.36 Tn
Water
$ 0.13 Tn
Leisure / Recreation /
Clothing
Electricity
$ 7.80 Tn
$ 2.94 Tn
Global system-of-systems
$54 Trillion
(100% of WW 2008 GDP)
Healthcare
$ 4.27 Tn
Infrastructure
$ 12.54 Tn
Finance
$ 4.58 Tn
Food
$ 4.89 Tn
Govt. & Safety
$ 5.21 Tn
Legend for system inputs
Same Industry
Business Support
IT Systems
Energy Resources
Machinery
Materials
Trade
IBM analysis based on OECD data.
9
© 2012 IBM Corporation
IBM Research
© 2012 IBM Corporation
IBM Research
The food system is complex, and
interventions often have
unintended and deleterious effects
on food security, or have major
consequences that affect GHS
emissions. Agricultural, economic,
and climate modelers must
compare their models more
systematically, share results, and
integrate their work to meet the
needs of policy-makers.
© 2012 IBM Corporation
IBM Research
Health is a state of
complete physical,
mental, and social
well-being and not
merely the absence
of disease or
infirmity.
© 2012 IBM Corporation
IBM Research
Example: Unintended Outcomes in Healthcare Optimization
Avg. Patient Delay
Re-design
Simulation model of
0
6
12
18
Time
(months)
Calgary Lab Services
T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12.
© 2012 IBM Corporation
IBM Research
Example: Unintended Outcomes in Healthcare Optimization
Avg. Patient Delay
Re-design
Simulation model of
0
6
12
18
Time
(months)
Calgary Lab Services
System-dynamics social model of lab use
T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12.
© 2012 IBM Corporation
IBM Research
Example: Unintended Outcomes in Healthcare Optimization
Avg. Patient Delay
Re-design
Moral:
Simulation model of
0
6
12
18
Time
(months)
Combine models across disciplines
for more robust decision making
Calgary Lab Services
System-dynamics social model of lab use
T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12.
© 2012 IBM Corporation
IBM Research
Combining Models Across Disciplines is HARD
 Domain experts have
different worldviews
 Use different vocabularies
 Sit in different
organizations
 Develop models on
different platforms
 Don’t want to rewrite
existing models!
Huang, T. T, Drewnowski, A., Kumanyika, S. K., & Glass, T. A., 2009,
“A Systems-Oriented Multilevel Framework for Addressing Obesity in the 21st Century,”
Preventing Chronic Disease, 6(3)
16
© 2012 IBM Corporation
IBM Research
Prior approaches to Combining Models
Monolithic models
 Create a monolithic model that encompasses all relevant domains
Integrated models
 Create modules that can be compiled into one
 SpatioTemporal Epidemiological Modeler (STEM)
 Community Atmospheric Model (CAM)
Tightly-coupled models
 Create modules that understand standard interfaces
 DOD High Level Architecture (HLA)
 Discrete-Event System Specification (DEVS)
 Open Modeling Interface (OpenMI).
© 2012 IBM Corporation
IBM Research
Splash: An Alternative Approach
Loosely couple models and data via data exchange
Splash = data integration + workflow management + simulation
Re-use heterogeneous models and heterogeneous data that are
curated by different domain experts
© 2012 IBM Corporation
IBM Research
Some Benefits of Loose Coupling
Facilitates cross-disciplinary modeling, analytics, and simulation
for robust decision making under uncertainty
Enables re-use of models and datasets
Encourages comprehensive documentation and curation of models via metadata
Allows model flexibility:
– Upgrading to state-of-the-art
– Customizing for different users
© 2012 IBM Corporation
IBM Research
Splash
A prototype platform and service for integrating existing data, models, and
simulations to gain insight needed for complex decision making related to
policy, planning, and investment.
Splash Platform
Model and Data Curation
Model and Data Discovery
Data
SADL
Models
Model Composition
Analysis
Visualization
Composite-Model Execution
Experiment Management
DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages
© 2012 IBM Corporation
IBM Research
Model and Data Curation
Splash Platform
Model and Data Curation
Model and Data Discovery
Data
SADL
Models
Model Composition
Analysis
Visualization
Composite-Model Execution
Experiment Management
DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages
© 2012 IBM Corporation
IBM Research
Splash Actor Description Language (SADL)
 SADL provides “schemas and
constraints” for models,
transformations, and data, enabling
interoperability
 SADL file for data (can exploit XSD)
– Attribute names, semantics, units
– Constraints
– How to access
– Security
– Experiment-management info
 SADL file for a model:
– Inputs and outputs (pointers to SADL files
for data sources and sinks)
– How to execute (info needed to synthesize
command line)
– Semantics and assumptions
– Provenance (papers, ratings, ownership,
security, change history, …)
– RNG info
<Actor name="BMI Model" type = "model" model_type = "simulation”
sim_type = "continuous-deterministic” owner="Jane Modeler">
<Description>
Predict weight change over time based on an individual’s energy and food
intake. Implemented in C. Reference: http://csel/asu.edu/?q=Weight
</Description>
<Environment>
<Variable name="EXEC_DIR" default="/Splash" description="executable
directory path"/>
<Variable name="SADL_DIR" default="/Splash/SADL" description="schema
directory path"/>
</Environment>
<Execution>
<Command>$EXEC_DIR/Models/BMIcalc.out</Command>
<Title>Run BMI model</Title>
</Execution>
‘ <Arguments>
<Input name="demographics" sadl="$SADL_DIR/BMIInput.sadl"
description="demographics data"/>
<Output name="people" sadl="$SADL_DIR/BMIOutput.sadl"
description="people’s daily calculated BMI"/>
</Arguments>
</Actor>
© 2012 IBM Corporation
IBM Research
Registration: Use Wizards to Create Model and Data SADL Files
Model Wizard offers step by
step guidance to generate the
Model’s SADL, and the
command line for invocation
Data Wizard generates SADL for
model input and output files
© 2012 IBM Corporation
IBM Research
Model Composition
Splash Platform
Model and Data Curation
Model and Data Discovery
Data
SADL
Models
Model Composition
Analysis
Visualization
Composite-Model Execution
Experiment Management
DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages
© 2012 IBM Corporation
IBM Research
Obesity Example
Data source
Dataflow
Simulation model
Dataflow
Data Transformation
Transportation
(VISUM simulation model)
GIS data
Geospatial alignment
Buying and eating
(Agent-based simulation model)
Time alignment
and data merging
Demographic
data
Exercise
(Stochastic discrete-event simulation)
Facility data
BMI Model
(Differential equation model)
Results
© 2012 IBM Corporation
IBM Research
Sample Results
If we open a new “healthy” food store in a “bad” neighborhood…
BMI by rich/poor
BMI by rich/poor
poor
rich
Without traffic model
* Many assumptions, sample only, your mileage may vary …
poor
rich
Including traffic model
© 2012 IBM Corporation
IBM Research
Implemented Obesity Example
Model actor
Mapping actor
Data actor
Visualization actor
 Data actors: input and output files, databases, web services, etc.
 Model actors: simulation, optimization, statistical models
 Mapping actors: data transformations, time and space alignment
 Visualization actors: graphs, reports, etc.
© 2012 IBM Corporation
IBM Research
Implemented Obesity Example
Models and data can
reside at different locations
Model actor
Mapping actor
Data actor
Visualization actor
 Data actors: input and output files, databases, web services, etc.
 Model actors: simulation, optimization, statistical models
 Mapping actors: data transformations, time and space alignment
 Visualization actors: graphs, reports, etc.
© 2012 IBM Corporation
IBM Research
Implemented Obesity Example
Model actor
Mapping actor
Data actor
Visualization actor
 Data actors: input and output files, databases, web services, etc.
 Model actors: simulation, optimization, statistical models
 Mapping actors: data transformations, time and space alignment
 Visualization actors: graphs, reports, etc.
© 2012 IBM Corporation
IBM Research
Data Transformations Between Models
 Transformation design tools for structural (schema) and time alignments
 SADL metadata used to automatically detect mismatches
 Splash generates code for massive-scale transformation on Hadoop at simulation time
Clio++: Schema mapping
& unit corrections
Time Aligner: Time-series
harmonization
© 2012 IBM Corporation
IBM Research
Composite-Model Execution
Splash Platform
Model and Data Curation
Model and Data Discovery
Data
SADL
Models
Model Composition
Analysis
Visualization
Composite-Model Execution
Experiment Management
DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages
© 2012 IBM Corporation
IBM Research
Executing a Composite Model: The Need for Runtime Efficiency
A huge parameter space to explore
(many model runs)
 Ex: 3 models + 10 params/model + 2
vals/param = over 1 billion model runs
 Even worse for stochastic models
(multiple Monte Carlo replications)
 Experimental design can help
Each model run can be extremely time
consuming
 Large-scale, high resolution models
produce and consume massive amounts
of time-series and other data
T-cell biology model
Regional traffic model
 CPU-intensive computations
 Composing models (with data
transformations) intensifies the problem
NCAR Community
Atmosphere Model (CAM)
Agent-based social model
32
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
t0
Irregular source time series
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Time alignment with MapReduce
s0
Irregular source time series
Sliding window by size (n=4)
t0
Regular target time series to be calculated.
Interpolation, nearest neighbor, aggregation (since-last, since-start)
© 2012 IBM Corporation
IBM Research
Cubic-Spline Interpolation in MapReduce
 Recall: Source outputs 1 tick per two days;
target needs one tick per day
 (Natural) cubic spline widely used
– Uniformly approximates f and f ’
– Error of O(h 4) as knot spacing h Ø 0
– Default method in SAS
 Given source and target time series:
S  (s0 , d0 ),(s1 , d1 ),  ,(sm , dm ) and T  (t0 , d0 ),(t1 , d1 ),  ,(tn , dn )
 Given window Wi for ti : Wi  (s j , d j ,  j ),(s j 1 , d j 1 ,  j 1 ) where [s j , s j 1 ) contains ti
 d j 1  j 1hj
j
 j 1
(s j 1  ti )3 
(ti  s j )3  
di  f (Wi ) 

 h
6h j
6h j
6
 j

 d j  j hj 
(
)
t

s


 i

 (s j 1  ti )
j
6
h

 j

hj  s j 1  s j
© 2012 IBM Corporation
IBM Research
Question: How to Compute Spline Constants?
 Must solve Ax = b (m-1 rows and columns):
 h0  h1

 3
 h1

 6
A


 0


 0

h1
6
h1  h2
3

h2
6

0
0
0

0
0

0
0



0

hm3
6
0

0
hm3  hm2
3
hm2
6




0





hm2

6

hm2  hm1 

3

0
d2  d1 d1  d0



h
h0
1


d3  d2 d2  d1


b
h2
h1



 dm  dm1 dm1  dm2


hm2
 hm1











 Prior work
– Some solutions require evenly spaced source points
– Some solutions require precomputation (somehow) of A -1
– Other solutions for vector machines, MPI architectures, GPUs
• Require a lot of data shuffling (reduce steps) in Hadoop adaptation
• Example: Parallel Cyclic Reduction (PCR) uses log2m map-reduce jobs
2
 Our approach: minimize L(x )  Ax  b 2 
43
 (A x  b )
2
i
i.
i


i
Li ( x )
© 2012 IBM Corporation
IBM Research
Our Solution: Distributed Stochastic Gradient Descent (DSGD)
 Originally for matrix completion, e.g., Netflix
ratings problem [GHS KDD11]
 Uses stochastic gradient descent (SGD) to
minimize L
( n 1)
 x (n)   n L '( x (n) )
– Deterministic gradient descent (DGD): x
where L '(x (n) ) 

m1 '
i
i 1
L ( x ( n) )
– Stochastic gradient descent: x
( n 1)
ˆ'( x (n) )
 x ( n)   n L
ˆ'(x (n) )  (m  1)L'I (x (n) )
where L
and I is randomly chosen from [1.. m  1]
– Avoids getting stuck at local minima
– Problem: SGD is not a parallel algorithm
 Idea: run SGD on subsets (strata) of rows,
randomly switch strata; choose “sparse” strata
that allow parallel execution of SGD
– Converges to overall solution with probability 1
under mild conditions
© 2012 IBM Corporation
IBM Research
Choosing Strata
Goal: Permit parallel execution of SGD within each stratum
Key observation: L'i (x)   0  0 ui ,i 1 ui ,i ui ,i 1 0  0  Updating xi only affects
(and is affected by) xi-1 and xi+1
where ui , j  2ai , j (ai ,i 1 xi 1  ai ,i xi  ai ,i 1 xi 1 )
Stratum choice:
 Can implement as map-only Hadoop job
(almost no data shuffling)
 Exploit discrepancy between logical splits
and physical blocks
Empirical study:
 2x-3x faster than best-of-breed PCR alg.
 10 scans vs logm for PCR
 PCR requires extra sort
 PCR requires massive data shuffling
(network bottleneck)
© 2012 IBM Corporation
IBM Research
Speeding up Composite Simulations: Result Caching
Motivating example: Two models in series, 100 reps
Deterministic
Stochastic
 Naïve approach: execute composite model
(i.e., Models 1 & 2) 100 times
 A better approach:
– Execute Model 1 once and cache result
– Read from cache when executing Model 2
Question: Can result-caching idea be generalized?
46
© 2012 IBM Corporation
IBM Research
General Method for Two Stochastic Models in Series
Stochastic
Stochastic
Goal: Estimate   E[Y2 ] based on n replications
Result-caching approach:
1. Set mn  n for some   (0,1] (the re-use factor)
Ex: n=10, mn = 4
2. Generate mn outputs from Model 1 and cache them
3. To execute Model 2, cycle through Model 1 outputs
4. Estimate  by n 
47

n
i1
Y2;i / n
© 2012 IBM Corporation
IBM Research
Optimizing the Re-Use Factor for Maximum Efficiency
Q: How to trade off cost and precision?
 Assume a (large) fixed computational budget c
 Random cost model: correlated pair (i , Yi )
– i  (random) cost of producing an observation Yi
– N(c)  # of observations of Y2 generated under c
N(c)
– ˆ
(c)   Y2;j / N(c)
j1
(c) :
 Approx. distribution of ˆ
variance  g() / c

r  1 /  
48
© 2012 IBM Corporation
IBM Research
The Optimal Re-Use Factor
Optimal solution
 Assume that Cov[Y2 , Y 2 ]  0
 Optimal value of :
1/2




E[
]
/
E[
]
*
2
1

 
 Var[Y2 ] / Cov[Y2 , Y 2 ]  1 




(truncate at 1/n or 1)
Observations
– If E[Model 1 cost] >> E[Model 2 cost], then high re-use of output
– If Model 2 insensitive to Model 1 (Cov << Var), then high re-use
– If Model 1 is deterministic (Cov = 0), then total re-use
49
© 2012 IBM Corporation
IBM Research
Experiment Management (and Optimization)
Splash Platform
Model and Data Curation
Model and Data Discovery
Data
SADL
Models
Model Composition
Analysis
Visualization
Composite-Model Execution
Experiment Management
DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages
© 2012 IBM Corporation
IBM Research
Experiment Design and Efficiency
Trades off execution cost versus
level of detail that can be estimated
Coarse resolution is OK for sensitivity
analysis etc.
Resolution III
design
Example: 1st-order polynomial metamodel for scaled data (7 factors)
Y  0  1 x1   7 x 7
 1;2 x1 x 2    6;7 x 6 x 7  1;2;3 x1 x 2 x 3    noise
x1 , , x 7  {1,1} (full factorial = 128 runs)
Fractional-factorial
experimental
designs
To estimate
If you can ignore
Resolution # runs
Main effects
All high-order effects
III
8
Main effects
3rd-order and higher
IV
16
Main effects
+ 2-way interactions
3rd-order and higher
V
64
© 2012 IBM Corporation
IBM Research
Running experiments in Splash
Goal
 Provide a facility that gives the illusion of
executing one coherent simulation model
Main Challenges
 Automate the coordination between experiment
conditions and inputs to different submodels.
 Automate the combination of different
replications of different submodels.
© 2012 IBM Corporation
IBM Research
Example: Healthcare Payer Model
Composition of two models
 Emory/Georgia Tech Predictive Health Institute model [Park et al. 2012]
– Simple agent-based model of prevention and wellness program
– For investigation of payment systems (capitated vs outcome-based)
 Simple logarithmic random walk model of interest & inflation rates
53
© 2012 IBM Corporation
IBM Research
Experiment Manager (Specifying Experimental Factors)
SADL
<attribute name="paymentModel"
measurement_type="numerical"
missing_data="0"
experiment_default_values=""
GUI collects simulation
parameters from
all component models
experiment_factor = TRUE
in SADL file
experiment_factor="true"
datatype="double"
random_seed="false" />
User selects values
User selects subset of
parameters as
for each experiment factor
experiment factors
© 2012 IBM Corporation
IBM Research
Experiment Design in Splash
Design Persistence
EML
<model name= PHI>…
<factor name=“Tage">
<values>“65"</values>
<values>“85"</values>
</factor>…
<rep n=“10”>…
</experiment>
Editable design
(Factor values and
# of Monte Carlo reps
for each condition)
Execution Engine
© 2012 IBM Corporation
IBM Research
Experiment Manager (Running an Experiment)
Technical challenges include:
 Routing parameter values to models
– Different sources: command line args,
parameter files, stdin, …
– Synthesizing the parameter files that a
model expects (templating)
Experiment Manager invokes
Splash execution engine to
run experiments
 Managing PRNG seeds
– Avoiding cycle overlaps
– PRNG info in SADL file
– Diagnostics (future work)
Intermediate and final outputs can
be saved in a file tree for
- Provenance tracking
- Traceability
- Drill down
© 2012 IBM Corporation
IBM Research
Template-Based Data File Generation Process
…
<attributes>
<attribute name=temperature
Datatype=numeric…/>
<attribute name=pressure
Datatype=numeric…/>
Input data for city of Detroit
Temperature=$$temperature$$&&%0.1&&
Pressure=$$pressure$$&&%0.1&&
Input data for city of Chicago
Temperature=$$temperature$$&&%0.1&&
Pressure=$$pressure$$&&%0.1&&
50.2, 25.1
48.7, 32.1
…
…
Data SADL File
Template File
Input Values
Data File Generator
Input data for city of Detroit
Temperature=50.2
Pressure=25.1
Input data for city of Chicago
Temperature=48.7
Pressure=32.1
Needed by Experiment
Manager for file synthesis
Input Data File
© 2012 IBM Corporation
IBM Research
Template-Based Data Extraction Process
Input data for city of Detroit
Temperature=$$temperature$$&&%0.1&&
Pressure=$$pressure$$&&%0.1&&
Input data for city of Chicago
Temperature=$$temperature$$&&%0.1&&
Pressure=$$pressure$$&&%0.1&&
Input data for city of Detroit
Temperature=50.2
Pressure=25.1
Input data for city of Chicago
Temperature=48.7
Pressure=32.1
Template File
Unstructured
Data File
Data Extractor
Extracted Values
Needed to extract
performance measures of interest
for optimization, visualization, etc.
50.2, 25.1
48.7, 32.1
…
© 2012 IBM Corporation
IBM Research
Efficient Sensitivity Analysis
 Main-effects plots:
– High/low values
– Orthogonal fractional
factorial experiment
design (160 vs 2560 runs)
PHI healthcare payer model +
interest-rate model
(Park et al., Service Science, 2012)
Identify the most important
profit drivers
(CapAmt & Tage)
Check statistical significance
of graphical results
© 2012 IBM Corporation
IBM Research
Optimization Functionality: Ranking and Selection
 Rinott procedure for finding best among small number of designs
 Executes min. # of runs needed to distinguish systems
Equal number of stage-1
replications per design
For i = 1 to k
Execute n0 stage-1 replications of model i to obtain Yi ,1 , , Yi , n0
Set X i  (1/ n0 ) j01 Yi , j and Vi  (1/(n0  1)) j01 (Yi , j  X i ) 2
n
System
determines
number of stage2 replications
Results are
combined
and ranked
60
n
  h 2Vi  
Set N i  max  n0 ,  2   , where h  h(C , ) is a tabulated constant


   
Execute N i  n0 additional replications of model i to obtain Yi , n0 1 ,  , Yi , Ni
Compute Yi  (1/ N i ) j i 1 Yi , j
N
Select system with largest value of Yi as the best system
Compute MCB intervals for i  1, 2, , k:
ai  min(0, max j  i Y j  Yi  ) and bi  max(0, max j  i Y j  Yi  )
Simultaneous
100C% MCB
conf. intervals.
Selects design within  of optimum
with probability  C
© 2012 IBM Corporation
IBM Research
Results for PHI Profitability: Estimated Best System
“Conditions” = payment schemes for wellness program
(0 = full capitation, 1 = pay-for-outcome)
Look at weighted schemes: 0.1, 0.2, … , 0.9
PHI healthcare payer model +
interest-rate model
(Park et al., Service Science, 2012)
With prob = 95%, C5 = 0.5 is the “best system”
(within indifference zone = $250K)
© 2012 IBM Corporation
IBM Research
Results Continued: Multiple Comparisons with the Best
Identifies set of best solutions
Simultaneous 95%
confidence intervals on
difference between each
system and best of others
© 2012 IBM Corporation
IBM Research
Simulation Metamodeling (Joint Work with SJSU CAMCOS)
“Simulation on demand”
1. Run simulations in advance to get
values at multiple “design points”
2. Fit a (stochastic) response surface
3. Decision maker can explore surface in
real time
4. Can apply stochastic optimization
techniques to find peaks and valleys
5. Can use for factor screening
Technique: Stochastic Kriging
(Ankenman et al., Oper. Res., 2010)
 Robust, global fit
 Gives approximate model response
+ uncertainty estimates (MSE)
 Efficient allocation to of runs to minimize
integrated mean-square error (IMSE)
 Metamodel added to Splash repository
Image: SJSU CAMCOS
Models uncertainty due to both
interpolation and simulation variability
© 2012 IBM Corporation
IBM Research
Assessment of PHI metamodel
Metamodel gives good
approximation to real
results (1.6% error in
this example)
Faster by over two
orders of magnitude
© 2012 IBM Corporation
IBM Research
Factor screening (Joint with SJSU CAMCOS)
Goal: identify most important subset of drivers
 Drivers captured in metamodel parameters
Ex: Linear models Y (x )  0  1 x 1    7 x 7  
 Main effects used for screening
 For Gaussian noise, positive effects: sequential bifurcation
Ex: Gaussian process models Y j (x )  0  M (x )   j (x )
 Special case of stochastic kriging
  j (x ) = simulation noise
 M (x) = interpolation uncertainty, modeled as Gaussian field
– For any x1, x2, … , xr vector V = (M(x1), … , M(xr))
is multivariate normal
n
– Cov[M (x i ), M (x j )]   2  k 1 exp(k (x i ,k  x j ,k )2 )
 Small k  small effect of k th factor
 Bayesian “posterior quantiles” method for screening
© 2012 IBM Corporation
IBM Research
Some Potential Splash Applications
66
© 2012 IBM Corporation
IBM Research
Multi-level, End-to-End Modeling
Socio-Economic Models
Business Models
4
3
Healthcare Ecosystem
(Society)
5
Lever1
System Structure
(Organizations)
2
Lever3
Policy “Flight
Simulator”
Delivery Operations
Careflow Models
(Flow of Patients,
Money, Information)
Lever2
(Processes)
1
Clinical Practices
6
(People)
Personalized Medicine
Disease Progression Models
Rouse, W. B. & Cortese, D. A. (2010). Introduction, in W. B. Rouse & D. A. Cortese (Eds.),
Engineering the System of Healthcare Delivery. IOS Press.
(Targeted interventions)
© 2012 IBM Corporation
IBM Research
Cross-domain, Syndemic Modeling
Richard Rothenberg et al., Georgia State University, 2011
© 2012 IBM Corporation
IBM Research
Composite model for traffic safety
Collision Heatmaps
Impact of… x …on collisions
Emergency Data
Emergency Response
Model (Client)
IBM Deep Thunder
weather model
IBM Megaffic traffic
simulation model
Heavy Snow
Game Day
Pay Day
Combined
Weather Data
Collision Data
Volume Data
Geographic Model
(ESRI)
Legend
Component Model
GIS Data
Demographic
Data
Intervention Scenarios
A: Roadway design changes
Data Source
Transformations
Data Flow
B: Placement variable speed limits
C: Enforcement
© 2012 IBM Corporation
IBM Research
Open Research Questions
70
© 2012 IBM Corporation
IBM Research
How to Determine User Requirements?
Common to Analysts and Scientists
 Examine schemas (data) and variables (models) prior to
selection
 Compare output of simulation results to examine tradeoffs and simulation selection
 Dashboard with summary of models and data sources
used to run a simulation
Specific to Analysts
 Guidance and recommendations
 Pre-defined templates for simulation set-up and
analyzing simulation output
 Recommendations for what template to use and the
steps to run a simulation
 Recommended output visualization – suggest one chart
style would be better than another style to explain
relationships in data
Specific to Scientists
 Feature to assess the veracity and provenance of model
and data sources
 Ability to upload their own sources to supplement the
existing sources
 High levels of interaction with the models & data when
previewing search results prior to running the
simulation
© 2012 IBM Corporation
IBM Research
Database Research++
 Data search Æ model-and-data search
– Find compatible models, data, and mappings (using metadata)
– Involves semantic search technologies, repository management, privacy and security
 Data integration Æ model integration
– Simulation-oriented data mapping
– Geospatial alignment [e.g., Howe & Maier 2005]
– Hierarchical models with different resolutions
– Complex data transformations (e.g., raw simulation output to histogram)
 Query optimization Æ simulation-experiment optimization
– Optimally configure workflow among distributed data and models
– Factoring common operations across different mappings in the workflow
– Avoiding redundant computations across experiments (e.g., result caching)
– Statistical issues: managing pseudorandom numbers and Monte Carlo replications
© 2012 IBM Corporation
IBM Research
Some Deep Problems
 Causality approximation
– Fixed-point + perturbation approaches
– System support
– Theoretical support
Transportation
Model
Buying & Eating
Model
fn (t )  1 f n (t ), g n 1 (t ) 
gn (t )   2 f n 1 (t ), g n (t ) 
 Deep collaborative analytics
– Visualizing and mining the results
– Understanding and explaining results:
• Provenance [e.g., J. Friere et al.]
• Root-cause analysis
– Trusting results
• Model validation
• ManyEyes++, Swivel++
f(t )  1 f (t ), g (n t )  
 for t  n t ,(n  1)t 
g (t )   2 f (n t ), g (t )  
© 2012 IBM Corporation
IBM Research
Conclusion
 Splash:
– composition of heterogeneous models and data
to support cross-disciplinary decision making in complex systems
– Loose coupling of models through data exchange
– Combines data-integration, simulation, and workflow technologies
 Key features
SADL metadata language for curation and functionality
Automated detection of data mismatches
Semi-automated design of scalable data transformations (schema and time alignment)
Runtime accelerators
• MapReduce framework for scalable data transformations
• Map-only Hadoop method for cubic-spline interpolation
• Result-caching to minimize # of model executions
– Experiment-manager allows sensitivity analysis, factor screening and optimization
– Simulation metamodeling for real-time model exploration
–
–
–
–
 Many open research questions!
© 2012 IBM Corporation
IBM Research
Questions?
Splash project page:
http://researcher.watson.ibm.com/researcher/view_project.php?id=3931
© 2012 IBM Corporation
IBM Research
Backup Slides
76
© 2012 IBM Corporation
Splash Technology for Loose Coupling via Data Exchange
SADL metadata language
IBM Research
Kepler adapted for model composition
Design-time
components
Run-time components:
- Kepler adapted for model execution
- Experiment Manager
(sensitivity analysis, metamodeling, optimization)
77
Data transformation tools:
- Clio++
- Time Aligner (MapReduce algorithms)
- Templating mechanism
© 2012 IBM Corporation
IBM Research
Distributed SGD, Continued
 Divide the m-1 rows into three strata: U 1, U 2, U 3
 Decompose loss function:
L( x ) 
1
3
L1( x )  13 L2 ( x)  13 L3 (x )
where Ls (x )  3 iU s Li (x )
 Define (random) stratum sequence 1 ,  2 , 
 Execute SGD w.r.t. Lk at k th step in parallel
 Theorem: Suppose that x* = A-1b exists and
-  n  O(n ) for some   (0.5,1)
- ( n   n1 ) /  n  O( n )
- { n : n  0} is regenerative
with E[11 /  ]   and E[X1(s)]  0
Then x (n)  x * with probability 1
-
Stratum sequence occasionally restarts probabilistically
- Time  between restarts has finite 1/ moment
- Sequence spends ≈1/3 of its time on each stratum
 Proof: [GHS11] + Liapunov-function argument
© 2012 IBM Corporation
IBM Research
Hadoop Implementation
 Physical blocks and logical splits
– InputFormat operator creates splits
(one split per mapper)
– A split is mostly on one block
– Splits are usually disjoint
– Map job: each mapper first obtains
all split data (small amount of data
movement)
– Reduce job: massive shuffling of
data over network
 We allow splits to overlap by two rows
1 a1,1
2 a2,1
a1,2
b1
x1
a2,3
b2
3 a3,2
4 a4,3
a3,4
b3
x2
x3
a4,5
b4
5 a5,4
6 a6,5
a5,6
b5
x4
x5
a6,7
b6
x6
7 a7,6
8 a8,7
a7,8
b7
a8,9
b8
x7
x8
9 a9,8
10 a10,9
a9,10
b9
x9
a10,11 b10 x
10
11 a11,10 a11,12 b11 x11
12 a12,11 a12,13 b12 x12
13 a13,12 a13,14 b13 x13
split 1
split 2
stratum s = 1
 DSGD is implemented as a map-only
job (no data shuffling!)
(mapper 2 modifies x7)
© 2012 IBM Corporation
IBM Research
Hadoop Implementation
 Physical blocks and logical splits
– InputFormat operator creates splits
(one split per mapper)
– A split is mostly on one block
– Splits are usually disjoint
– Map job: each mapper first obtains
all split data (small amount of data
movement)
– Reduce job: massive shuffling of
data over network
 We allow splits to overlap by two rows
1 a1,1
2 a2,1
a1,2
b1
x1
a2,3
b2
3 a3,2
4 a4,3
a3,4
b3
x2
x3
a4,5
b4
5 a5,4
6 a6,5
a5,6
b5
x4
x5
a6,7
b6
x6
7 a7,6
8 a8,7
a7,8
b7
x7
a8,9
b8
9 a9,8
10 a10,9
a9,10
b9
x8
x9
a10,11 b10 x
10
11 a11,10 a11,12 b11 x
11
12 a12,11 a12,13 b12 x12
13 a13,12 a13,14 b13 x13
split 1
split 2
stratum s = 2
 DSGD is implemented as a map-only
job (no data shuffling!)
(mapper 2 modifies x7)
© 2012 IBM Corporation
IBM Research
Hadoop Implementation
 Physical blocks and logical splits
– InputFormat operator creates splits
(one split per mapper)
– A split is mostly on one block
– Splits are usually disjoint
– Map job: each mapper first obtains
all split data (small amount of data
movement)
– Reduce job: massive shuffling of
data over network
 We allow splits to overlap by two rows
1 a1,1
2 a2,1
a1,2
b1
x1
a2,3
b2
x2
3 a3,2
4 a4,3
a3,4
b3
a4,5
b4
x3
x4
5 a5,4
6 a6,5
a5,6
b5
a6,7
b6
x5
x6
7 a7,6
8 a8,7
a7,8
b7
x7
a8,9
b8
x8
9 a9,8
10 a10,9
a9,10
b9
x9
a10,11 b10 x
10
11 a11,10 a11,12 b11 x
11
12 a12,11 a12,13 b12 x12
13 a13,12 a13,14 b13 x13
split 1
split 2
stratum s = 3
 DSGD is implemented as a map-only
job (no data shuffling!)
(x7 affects mapper 1)
© 2012 IBM Corporation
IBM Research
Other Implementation Details
 Initial guess
– Ignore off-diagonal elements
– Works well due to “diagonal dominance”
 Stratum sequence as in [GHS11]
– Meander in a stratum for a while, then jump to next stratum
– Tension between thorough exploration of stratum and randomness
– Visit all k rows in stratum: at each “sub-epoch” select one of k ! orders at random
– Similar strategy for jumping between strata
– Convergence Theorem still applies
 Step-size sequence
– Constant during sub-epoch
– “Bold driver” heuristic
– Experiment with initial step size
(in parallel on small subsequences)
© 2012 IBM Corporation
IBM Research
Optimizing the Re-Use Factor for Maximum Efficiency
To define (asymptotic) efficiency, consider budget-constrained setting
[Fox & Glynn 1990; Glynn & Whitt 1992]
 Cost of producing n outputs from Model 2:
Cn 

i;j  (random) cost of producing
jth observation of Yi
   j1 2;j
j1 1;j
mn
n
 Under (large) fixed computational budget c
– Number of Model 2 outputs produced:
N(c)  max{n  0 : Cn  c}
– Estimator:
U(c)  N(c)  N(c) 1  j1 Y2;j
N(c)
83
© 2012 IBM Corporation
IBM Research
Optimizing the Re-Use Factor II
The key limit theorem as budget increases to infinity
Suppose that E[1  2  Y 22 ]  . Then U(c) is asymptotically N(, g() / c).
where r  1 /   and


g()  (E[1 ]  E[2 ]) Var[Y2 ]   2r  r (r  1)  Cov[Y2 , Y 2 ]
(cost per obs.) x (contributed variance per obs.)
Cov[Y2 , Y 2 ]  covariance of two Model 2 outputs that share a Model 1 input
 Thus, minimize g() [or maximize asymptotic efficiency = 1 / g() ]
84
© 2012 IBM Corporation
IBM Research
Proof Outline
 Set Wn,j 

n
i1
Wn,1
Y2;i I[input for ith run of Model 2 is Y1;j ]
Wn,2
mn
 mn  1 mn
1





m
W
m
 Thus n 
n  j1
n;j
n  j1 Wn;j

 n 
Wn,3
 By Theorem 1 in [Glynn & Whitt 1992], it suffices to show that
a.s.
– C / n
c1  c 2 (straightforward to show)
n
– Wn,1 , Wn,2 , , W n,m obeys a “Lindeberg-Feller” FCLT
Wn,4
Wn,j and Wn,j ' are
independent for j  j '
n
 Can establish standard “Lindeberg condition” which suffices for FCLT (Billingsley 1999)
 Some additional fussy details due to the cycling through Model 1 outputs
85
© 2012 IBM Corporation
IBM Research
Point and Interval Estimates
Typical scenarios:
 Compute 100(1  )% confidence interval for  under fixed budget c
 Estimate  to within 100% with probability 100(1  )%
Issue: n is unknown a priori (so can’t compute mn)
 Solution: estimate n from n0 pilot (or prior) runs
 Can show:
n(n  )
hn ()
(c)
Wn,j
is "centered" version of Wn,j

n
1
(c)
 N(0,1) where hn ()  n  j1 Wn,j
m

2
1/2
1/2
so that CI from n runs is n  z  hn () / n  , n  z  hn () / n  


where z is (1  ) / 2 normal quantile
 Can set
– n  c / (c1  c 2 )

– n  hn0 () z / n0
86
for fixed budget

2
for fixed precision
© 2012 IBM Corporation
IBM Research
Interface to R system for experimental design
Method
Provider
Notes
Full Factorial
Design
Experiment Manager
Simple, fast design generation
Exhaustive factor combinations -> slow execution
Planor Fractional
Factorial Design
R – planor package
Supports arbitrary factor levels
Leverages R design generation
Checks statistical feasibility of user’s proposed design
Slow design generation, fast experiment execution
Auto Planor
Fractional Factorial
Design
R – planor package
FRF2 Fractional
Factorial Design
R – FrF2 package
http://cran.rproject.org/web/packages/FrF2/FrF2.pdf
Only supports 2-level factors
Fast generation
Fast execution
Custom
User Specified
Any design above may be used as basis
http://cran.rproject.org/web/packages/planor/vignett
es/PlanorInRmanual.pdf
http://cran.rproject.org/web/packages/planor/vignett
es/planorVignette.pdf
Supports arbitrary factor levels
Leverages R design generation
Automatically finds smallest feasible experiment
Slower design generation, fast experiment execution
As new designs are introduced in R, the
interface is in place to take advantage of these.
© 2012 IBM Corporation
IBM Research
Standard Kriging
M
Y
M , extrinsic
uncertainty
Images: SJSU
© 2012 IBM Corporation
IBM Research
Stochastic Kriging
M
Y
, intrinsic
uncertainty
MLE estimate:
Images: SJSU
Y
Σ
,∙
Σ
Σ
© 2012 IBM Corporation
IBM Research
Optimization Process Flow
- Optimizer is R code,
- Orchestration via Python scripts
= template-based data extraction
© 2012 IBM Corporation