Collaborative Modeling, Simulation, and Analytics with Splash IBM Research Nicole Barberis, Peter J. Haas, Cheryl Kieliszewski, Yinan Li, Paul Maglio, Piyaphol Phoungphol, Pat Selinger, Yannis Sismanis, Wang-Chiew Tan, Ignacio Terrizzano, Haidong Xue, SJSU CAMCOS IBM Research – Almaden Splash Smarter Planet Platform for Analysis and Simulation of Health http://researcher.watson.ibm.com/researcher/view_project.php?id=3931 © 2011 2012 IBM Corporation IBM Research Some Context: Model-Data Ecosystems 2 © 2012 IBM Corporation IBM Research My Two Communities Modeling and Simulation 3 Information Management & Analytics © 2012 IBM Corporation IBM Research Opportunities for Innovation at the Intersection MCDB & SimSQL Splash Modeling and Simulation 4 Information Management & Analytics © 2012 IBM Corporation IBM Research Some Further Thoughts and Examples [PODS 2014 Tutorial] (In addition to large-scale scientific environments) Data-intensive simulation – Simulations within databases – Databases within simulations – Data harmonization at scale Information integration – Simulation as an information-integration tool – Combining real and simulated data And more! 5 © 2012 IBM Corporation IBM Research Motivation for Splash 6 © 2012 IBM Corporation IBM Research The Setting: Analytics for Decision Support “Analytics is…a complete [enterprise] problem solving and decision making process” Descriptive Analytics: Finding patterns and relationships in historical and existing data Splash Predictive analytics: predict future probabilities and trends to allow what-if analysis Prescriptive analytics: deterministic and stochastic optimization to support better decision making © 2012 IBM Corporation IBM Research Shallow Versus Deep Predictive Analytics Extrapolation United States House Prices $275,000 $250,000 Actual prices $225,000 $200,000 $150,000 NCAR Community Atmosphere Model (CAM) $125,000 $100,000 $75,000 $50,000 $25,000 2010 2008 2006 2004 2002 2000 1998 1996 1994 1992 1990 1988 1986 1984 1982 1980 1978 1976 1974 1972 $0 1970 Price $175,000 Year Extrapolation of 1970-2006 median U.S. housing prices © 2012 IBM Corporation IBM Research Big, Difficult, Important Problems Span Many Disciplines Need collaborative cross-disciplinary modeling and simulation Communication Transportation $ 3.96 Tn $ 6.95 Tn Education $ 1.36 Tn Water $ 0.13 Tn Leisure / Recreation / Clothing Electricity $ 7.80 Tn $ 2.94 Tn Global system-of-systems $54 Trillion (100% of WW 2008 GDP) Healthcare $ 4.27 Tn Infrastructure $ 12.54 Tn Finance $ 4.58 Tn Food $ 4.89 Tn Govt. & Safety $ 5.21 Tn Legend for system inputs Same Industry Business Support IT Systems Energy Resources Machinery Materials Trade IBM analysis based on OECD data. 9 © 2012 IBM Corporation IBM Research © 2012 IBM Corporation IBM Research The food system is complex, and interventions often have unintended and deleterious effects on food security, or have major consequences that affect GHS emissions. Agricultural, economic, and climate modelers must compare their models more systematically, share results, and integrate their work to meet the needs of policy-makers. © 2012 IBM Corporation IBM Research Health is a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity. © 2012 IBM Corporation IBM Research Example: Unintended Outcomes in Healthcare Optimization Avg. Patient Delay Re-design Simulation model of 0 6 12 18 Time (months) Calgary Lab Services T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12. © 2012 IBM Corporation IBM Research Example: Unintended Outcomes in Healthcare Optimization Avg. Patient Delay Re-design Simulation model of 0 6 12 18 Time (months) Calgary Lab Services System-dynamics social model of lab use T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12. © 2012 IBM Corporation IBM Research Example: Unintended Outcomes in Healthcare Optimization Avg. Patient Delay Re-design Moral: Simulation model of 0 6 12 18 Time (months) Combine models across disciplines for more robust decision making Calgary Lab Services System-dynamics social model of lab use T. R. Rohleder & D. P. Bischak & L. B. Baskin (2007). Modeling patient service centers with simulation and system dynamics. Health Care Manage. Sci., 10:1–12. © 2012 IBM Corporation IBM Research Combining Models Across Disciplines is HARD Domain experts have different worldviews Use different vocabularies Sit in different organizations Develop models on different platforms Don’t want to rewrite existing models! Huang, T. T, Drewnowski, A., Kumanyika, S. K., & Glass, T. A., 2009, “A Systems-Oriented Multilevel Framework for Addressing Obesity in the 21st Century,” Preventing Chronic Disease, 6(3) 16 © 2012 IBM Corporation IBM Research Prior approaches to Combining Models Monolithic models Create a monolithic model that encompasses all relevant domains Integrated models Create modules that can be compiled into one SpatioTemporal Epidemiological Modeler (STEM) Community Atmospheric Model (CAM) Tightly-coupled models Create modules that understand standard interfaces DOD High Level Architecture (HLA) Discrete-Event System Specification (DEVS) Open Modeling Interface (OpenMI). © 2012 IBM Corporation IBM Research Splash: An Alternative Approach Loosely couple models and data via data exchange Splash = data integration + workflow management + simulation Re-use heterogeneous models and heterogeneous data that are curated by different domain experts © 2012 IBM Corporation IBM Research Some Benefits of Loose Coupling Facilitates cross-disciplinary modeling, analytics, and simulation for robust decision making under uncertainty Enables re-use of models and datasets Encourages comprehensive documentation and curation of models via metadata Allows model flexibility: – Upgrading to state-of-the-art – Customizing for different users © 2012 IBM Corporation IBM Research Splash A prototype platform and service for integrating existing data, models, and simulations to gain insight needed for complex decision making related to policy, planning, and investment. Splash Platform Model and Data Curation Model and Data Discovery Data SADL Models Model Composition Analysis Visualization Composite-Model Execution Experiment Management DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages © 2012 IBM Corporation IBM Research Model and Data Curation Splash Platform Model and Data Curation Model and Data Discovery Data SADL Models Model Composition Analysis Visualization Composite-Model Execution Experiment Management DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages © 2012 IBM Corporation IBM Research Splash Actor Description Language (SADL) SADL provides “schemas and constraints” for models, transformations, and data, enabling interoperability SADL file for data (can exploit XSD) – Attribute names, semantics, units – Constraints – How to access – Security – Experiment-management info SADL file for a model: – Inputs and outputs (pointers to SADL files for data sources and sinks) – How to execute (info needed to synthesize command line) – Semantics and assumptions – Provenance (papers, ratings, ownership, security, change history, …) – RNG info <Actor name="BMI Model" type = "model" model_type = "simulation” sim_type = "continuous-deterministic” owner="Jane Modeler"> <Description> Predict weight change over time based on an individual’s energy and food intake. Implemented in C. Reference: http://csel/asu.edu/?q=Weight </Description> <Environment> <Variable name="EXEC_DIR" default="/Splash" description="executable directory path"/> <Variable name="SADL_DIR" default="/Splash/SADL" description="schema directory path"/> </Environment> <Execution> <Command>$EXEC_DIR/Models/BMIcalc.out</Command> <Title>Run BMI model</Title> </Execution> ‘ <Arguments> <Input name="demographics" sadl="$SADL_DIR/BMIInput.sadl" description="demographics data"/> <Output name="people" sadl="$SADL_DIR/BMIOutput.sadl" description="people’s daily calculated BMI"/> </Arguments> </Actor> © 2012 IBM Corporation IBM Research Registration: Use Wizards to Create Model and Data SADL Files Model Wizard offers step by step guidance to generate the Model’s SADL, and the command line for invocation Data Wizard generates SADL for model input and output files © 2012 IBM Corporation IBM Research Model Composition Splash Platform Model and Data Curation Model and Data Discovery Data SADL Models Model Composition Analysis Visualization Composite-Model Execution Experiment Management DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages © 2012 IBM Corporation IBM Research Obesity Example Data source Dataflow Simulation model Dataflow Data Transformation Transportation (VISUM simulation model) GIS data Geospatial alignment Buying and eating (Agent-based simulation model) Time alignment and data merging Demographic data Exercise (Stochastic discrete-event simulation) Facility data BMI Model (Differential equation model) Results © 2012 IBM Corporation IBM Research Sample Results If we open a new “healthy” food store in a “bad” neighborhood… BMI by rich/poor BMI by rich/poor poor rich Without traffic model * Many assumptions, sample only, your mileage may vary … poor rich Including traffic model © 2012 IBM Corporation IBM Research Implemented Obesity Example Model actor Mapping actor Data actor Visualization actor Data actors: input and output files, databases, web services, etc. Model actors: simulation, optimization, statistical models Mapping actors: data transformations, time and space alignment Visualization actors: graphs, reports, etc. © 2012 IBM Corporation IBM Research Implemented Obesity Example Models and data can reside at different locations Model actor Mapping actor Data actor Visualization actor Data actors: input and output files, databases, web services, etc. Model actors: simulation, optimization, statistical models Mapping actors: data transformations, time and space alignment Visualization actors: graphs, reports, etc. © 2012 IBM Corporation IBM Research Implemented Obesity Example Model actor Mapping actor Data actor Visualization actor Data actors: input and output files, databases, web services, etc. Model actors: simulation, optimization, statistical models Mapping actors: data transformations, time and space alignment Visualization actors: graphs, reports, etc. © 2012 IBM Corporation IBM Research Data Transformations Between Models Transformation design tools for structural (schema) and time alignments SADL metadata used to automatically detect mismatches Splash generates code for massive-scale transformation on Hadoop at simulation time Clio++: Schema mapping & unit corrections Time Aligner: Time-series harmonization © 2012 IBM Corporation IBM Research Composite-Model Execution Splash Platform Model and Data Curation Model and Data Discovery Data SADL Models Model Composition Analysis Visualization Composite-Model Execution Experiment Management DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages © 2012 IBM Corporation IBM Research Executing a Composite Model: The Need for Runtime Efficiency A huge parameter space to explore (many model runs) Ex: 3 models + 10 params/model + 2 vals/param = over 1 billion model runs Even worse for stochastic models (multiple Monte Carlo replications) Experimental design can help Each model run can be extremely time consuming Large-scale, high resolution models produce and consume massive amounts of time-series and other data T-cell biology model Regional traffic model CPU-intensive computations Composing models (with data transformations) intensifies the problem NCAR Community Atmosphere Model (CAM) Agent-based social model 32 © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 t0 Irregular source time series Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Time alignment with MapReduce s0 Irregular source time series Sliding window by size (n=4) t0 Regular target time series to be calculated. Interpolation, nearest neighbor, aggregation (since-last, since-start) © 2012 IBM Corporation IBM Research Cubic-Spline Interpolation in MapReduce Recall: Source outputs 1 tick per two days; target needs one tick per day (Natural) cubic spline widely used – Uniformly approximates f and f ’ – Error of O(h 4) as knot spacing h Ø 0 – Default method in SAS Given source and target time series: S (s0 , d0 ),(s1 , d1 ), ,(sm , dm ) and T (t0 , d0 ),(t1 , d1 ), ,(tn , dn ) Given window Wi for ti : Wi (s j , d j , j ),(s j 1 , d j 1 , j 1 ) where [s j , s j 1 ) contains ti d j 1 j 1hj j j 1 (s j 1 ti )3 (ti s j )3 di f (Wi ) h 6h j 6h j 6 j d j j hj ( ) t s i (s j 1 ti ) j 6 h j hj s j 1 s j © 2012 IBM Corporation IBM Research Question: How to Compute Spline Constants? Must solve Ax = b (m-1 rows and columns): h0 h1 3 h1 6 A 0 0 h1 6 h1 h2 3 h2 6 0 0 0 0 0 0 0 0 hm3 6 0 0 hm3 hm2 3 hm2 6 0 hm2 6 hm2 hm1 3 0 d2 d1 d1 d0 h h0 1 d3 d2 d2 d1 b h2 h1 dm dm1 dm1 dm2 hm2 hm1 Prior work – Some solutions require evenly spaced source points – Some solutions require precomputation (somehow) of A -1 – Other solutions for vector machines, MPI architectures, GPUs • Require a lot of data shuffling (reduce steps) in Hadoop adaptation • Example: Parallel Cyclic Reduction (PCR) uses log2m map-reduce jobs 2 Our approach: minimize L(x ) Ax b 2 43 (A x b ) 2 i i. i i Li ( x ) © 2012 IBM Corporation IBM Research Our Solution: Distributed Stochastic Gradient Descent (DSGD) Originally for matrix completion, e.g., Netflix ratings problem [GHS KDD11] Uses stochastic gradient descent (SGD) to minimize L ( n 1) x (n) n L '( x (n) ) – Deterministic gradient descent (DGD): x where L '(x (n) ) m1 ' i i 1 L ( x ( n) ) – Stochastic gradient descent: x ( n 1) ˆ'( x (n) ) x ( n) n L ˆ'(x (n) ) (m 1)L'I (x (n) ) where L and I is randomly chosen from [1.. m 1] – Avoids getting stuck at local minima – Problem: SGD is not a parallel algorithm Idea: run SGD on subsets (strata) of rows, randomly switch strata; choose “sparse” strata that allow parallel execution of SGD – Converges to overall solution with probability 1 under mild conditions © 2012 IBM Corporation IBM Research Choosing Strata Goal: Permit parallel execution of SGD within each stratum Key observation: L'i (x) 0 0 ui ,i 1 ui ,i ui ,i 1 0 0 Updating xi only affects (and is affected by) xi-1 and xi+1 where ui , j 2ai , j (ai ,i 1 xi 1 ai ,i xi ai ,i 1 xi 1 ) Stratum choice: Can implement as map-only Hadoop job (almost no data shuffling) Exploit discrepancy between logical splits and physical blocks Empirical study: 2x-3x faster than best-of-breed PCR alg. 10 scans vs logm for PCR PCR requires extra sort PCR requires massive data shuffling (network bottleneck) © 2012 IBM Corporation IBM Research Speeding up Composite Simulations: Result Caching Motivating example: Two models in series, 100 reps Deterministic Stochastic Naïve approach: execute composite model (i.e., Models 1 & 2) 100 times A better approach: – Execute Model 1 once and cache result – Read from cache when executing Model 2 Question: Can result-caching idea be generalized? 46 © 2012 IBM Corporation IBM Research General Method for Two Stochastic Models in Series Stochastic Stochastic Goal: Estimate E[Y2 ] based on n replications Result-caching approach: 1. Set mn n for some (0,1] (the re-use factor) Ex: n=10, mn = 4 2. Generate mn outputs from Model 1 and cache them 3. To execute Model 2, cycle through Model 1 outputs 4. Estimate by n 47 n i1 Y2;i / n © 2012 IBM Corporation IBM Research Optimizing the Re-Use Factor for Maximum Efficiency Q: How to trade off cost and precision? Assume a (large) fixed computational budget c Random cost model: correlated pair (i , Yi ) – i (random) cost of producing an observation Yi – N(c) # of observations of Y2 generated under c N(c) – ˆ (c) Y2;j / N(c) j1 (c) : Approx. distribution of ˆ variance g() / c r 1 / 48 © 2012 IBM Corporation IBM Research The Optimal Re-Use Factor Optimal solution Assume that Cov[Y2 , Y 2 ] 0 Optimal value of : 1/2 E[ ] / E[ ] * 2 1 Var[Y2 ] / Cov[Y2 , Y 2 ] 1 (truncate at 1/n or 1) Observations – If E[Model 1 cost] >> E[Model 2 cost], then high re-use of output – If Model 2 insensitive to Model 1 (Cov << Var), then high re-use – If Model 1 is deterministic (Cov = 0), then total re-use 49 © 2012 IBM Corporation IBM Research Experiment Management (and Optimization) Splash Platform Model and Data Curation Model and Data Discovery Data SADL Models Model Composition Analysis Visualization Composite-Model Execution Experiment Management DBMS, Hadoop, Visualization Tools, InformationIntegration Tools, Stats Packages © 2012 IBM Corporation IBM Research Experiment Design and Efficiency Trades off execution cost versus level of detail that can be estimated Coarse resolution is OK for sensitivity analysis etc. Resolution III design Example: 1st-order polynomial metamodel for scaled data (7 factors) Y 0 1 x1 7 x 7 1;2 x1 x 2 6;7 x 6 x 7 1;2;3 x1 x 2 x 3 noise x1 , , x 7 {1,1} (full factorial = 128 runs) Fractional-factorial experimental designs To estimate If you can ignore Resolution # runs Main effects All high-order effects III 8 Main effects 3rd-order and higher IV 16 Main effects + 2-way interactions 3rd-order and higher V 64 © 2012 IBM Corporation IBM Research Running experiments in Splash Goal Provide a facility that gives the illusion of executing one coherent simulation model Main Challenges Automate the coordination between experiment conditions and inputs to different submodels. Automate the combination of different replications of different submodels. © 2012 IBM Corporation IBM Research Example: Healthcare Payer Model Composition of two models Emory/Georgia Tech Predictive Health Institute model [Park et al. 2012] – Simple agent-based model of prevention and wellness program – For investigation of payment systems (capitated vs outcome-based) Simple logarithmic random walk model of interest & inflation rates 53 © 2012 IBM Corporation IBM Research Experiment Manager (Specifying Experimental Factors) SADL <attribute name="paymentModel" measurement_type="numerical" missing_data="0" experiment_default_values="" GUI collects simulation parameters from all component models experiment_factor = TRUE in SADL file experiment_factor="true" datatype="double" random_seed="false" /> User selects values User selects subset of parameters as for each experiment factor experiment factors © 2012 IBM Corporation IBM Research Experiment Design in Splash Design Persistence EML <model name= PHI>… <factor name=“Tage"> <values>“65"</values> <values>“85"</values> </factor>… <rep n=“10”>… </experiment> Editable design (Factor values and # of Monte Carlo reps for each condition) Execution Engine © 2012 IBM Corporation IBM Research Experiment Manager (Running an Experiment) Technical challenges include: Routing parameter values to models – Different sources: command line args, parameter files, stdin, … – Synthesizing the parameter files that a model expects (templating) Experiment Manager invokes Splash execution engine to run experiments Managing PRNG seeds – Avoiding cycle overlaps – PRNG info in SADL file – Diagnostics (future work) Intermediate and final outputs can be saved in a file tree for - Provenance tracking - Traceability - Drill down © 2012 IBM Corporation IBM Research Template-Based Data File Generation Process … <attributes> <attribute name=temperature Datatype=numeric…/> <attribute name=pressure Datatype=numeric…/> Input data for city of Detroit Temperature=$$temperature$$&&%0.1&& Pressure=$$pressure$$&&%0.1&& Input data for city of Chicago Temperature=$$temperature$$&&%0.1&& Pressure=$$pressure$$&&%0.1&& 50.2, 25.1 48.7, 32.1 … … Data SADL File Template File Input Values Data File Generator Input data for city of Detroit Temperature=50.2 Pressure=25.1 Input data for city of Chicago Temperature=48.7 Pressure=32.1 Needed by Experiment Manager for file synthesis Input Data File © 2012 IBM Corporation IBM Research Template-Based Data Extraction Process Input data for city of Detroit Temperature=$$temperature$$&&%0.1&& Pressure=$$pressure$$&&%0.1&& Input data for city of Chicago Temperature=$$temperature$$&&%0.1&& Pressure=$$pressure$$&&%0.1&& Input data for city of Detroit Temperature=50.2 Pressure=25.1 Input data for city of Chicago Temperature=48.7 Pressure=32.1 Template File Unstructured Data File Data Extractor Extracted Values Needed to extract performance measures of interest for optimization, visualization, etc. 50.2, 25.1 48.7, 32.1 … © 2012 IBM Corporation IBM Research Efficient Sensitivity Analysis Main-effects plots: – High/low values – Orthogonal fractional factorial experiment design (160 vs 2560 runs) PHI healthcare payer model + interest-rate model (Park et al., Service Science, 2012) Identify the most important profit drivers (CapAmt & Tage) Check statistical significance of graphical results © 2012 IBM Corporation IBM Research Optimization Functionality: Ranking and Selection Rinott procedure for finding best among small number of designs Executes min. # of runs needed to distinguish systems Equal number of stage-1 replications per design For i = 1 to k Execute n0 stage-1 replications of model i to obtain Yi ,1 , , Yi , n0 Set X i (1/ n0 ) j01 Yi , j and Vi (1/(n0 1)) j01 (Yi , j X i ) 2 n System determines number of stage2 replications Results are combined and ranked 60 n h 2Vi Set N i max n0 , 2 , where h h(C , ) is a tabulated constant Execute N i n0 additional replications of model i to obtain Yi , n0 1 , , Yi , Ni Compute Yi (1/ N i ) j i 1 Yi , j N Select system with largest value of Yi as the best system Compute MCB intervals for i 1, 2, , k: ai min(0, max j i Y j Yi ) and bi max(0, max j i Y j Yi ) Simultaneous 100C% MCB conf. intervals. Selects design within of optimum with probability C © 2012 IBM Corporation IBM Research Results for PHI Profitability: Estimated Best System “Conditions” = payment schemes for wellness program (0 = full capitation, 1 = pay-for-outcome) Look at weighted schemes: 0.1, 0.2, … , 0.9 PHI healthcare payer model + interest-rate model (Park et al., Service Science, 2012) With prob = 95%, C5 = 0.5 is the “best system” (within indifference zone = $250K) © 2012 IBM Corporation IBM Research Results Continued: Multiple Comparisons with the Best Identifies set of best solutions Simultaneous 95% confidence intervals on difference between each system and best of others © 2012 IBM Corporation IBM Research Simulation Metamodeling (Joint Work with SJSU CAMCOS) “Simulation on demand” 1. Run simulations in advance to get values at multiple “design points” 2. Fit a (stochastic) response surface 3. Decision maker can explore surface in real time 4. Can apply stochastic optimization techniques to find peaks and valleys 5. Can use for factor screening Technique: Stochastic Kriging (Ankenman et al., Oper. Res., 2010) Robust, global fit Gives approximate model response + uncertainty estimates (MSE) Efficient allocation to of runs to minimize integrated mean-square error (IMSE) Metamodel added to Splash repository Image: SJSU CAMCOS Models uncertainty due to both interpolation and simulation variability © 2012 IBM Corporation IBM Research Assessment of PHI metamodel Metamodel gives good approximation to real results (1.6% error in this example) Faster by over two orders of magnitude © 2012 IBM Corporation IBM Research Factor screening (Joint with SJSU CAMCOS) Goal: identify most important subset of drivers Drivers captured in metamodel parameters Ex: Linear models Y (x ) 0 1 x 1 7 x 7 Main effects used for screening For Gaussian noise, positive effects: sequential bifurcation Ex: Gaussian process models Y j (x ) 0 M (x ) j (x ) Special case of stochastic kriging j (x ) = simulation noise M (x) = interpolation uncertainty, modeled as Gaussian field – For any x1, x2, … , xr vector V = (M(x1), … , M(xr)) is multivariate normal n – Cov[M (x i ), M (x j )] 2 k 1 exp(k (x i ,k x j ,k )2 ) Small k small effect of k th factor Bayesian “posterior quantiles” method for screening © 2012 IBM Corporation IBM Research Some Potential Splash Applications 66 © 2012 IBM Corporation IBM Research Multi-level, End-to-End Modeling Socio-Economic Models Business Models 4 3 Healthcare Ecosystem (Society) 5 Lever1 System Structure (Organizations) 2 Lever3 Policy “Flight Simulator” Delivery Operations Careflow Models (Flow of Patients, Money, Information) Lever2 (Processes) 1 Clinical Practices 6 (People) Personalized Medicine Disease Progression Models Rouse, W. B. & Cortese, D. A. (2010). Introduction, in W. B. Rouse & D. A. Cortese (Eds.), Engineering the System of Healthcare Delivery. IOS Press. (Targeted interventions) © 2012 IBM Corporation IBM Research Cross-domain, Syndemic Modeling Richard Rothenberg et al., Georgia State University, 2011 © 2012 IBM Corporation IBM Research Composite model for traffic safety Collision Heatmaps Impact of… x …on collisions Emergency Data Emergency Response Model (Client) IBM Deep Thunder weather model IBM Megaffic traffic simulation model Heavy Snow Game Day Pay Day Combined Weather Data Collision Data Volume Data Geographic Model (ESRI) Legend Component Model GIS Data Demographic Data Intervention Scenarios A: Roadway design changes Data Source Transformations Data Flow B: Placement variable speed limits C: Enforcement © 2012 IBM Corporation IBM Research Open Research Questions 70 © 2012 IBM Corporation IBM Research How to Determine User Requirements? Common to Analysts and Scientists Examine schemas (data) and variables (models) prior to selection Compare output of simulation results to examine tradeoffs and simulation selection Dashboard with summary of models and data sources used to run a simulation Specific to Analysts Guidance and recommendations Pre-defined templates for simulation set-up and analyzing simulation output Recommendations for what template to use and the steps to run a simulation Recommended output visualization – suggest one chart style would be better than another style to explain relationships in data Specific to Scientists Feature to assess the veracity and provenance of model and data sources Ability to upload their own sources to supplement the existing sources High levels of interaction with the models & data when previewing search results prior to running the simulation © 2012 IBM Corporation IBM Research Database Research++ Data search Æ model-and-data search – Find compatible models, data, and mappings (using metadata) – Involves semantic search technologies, repository management, privacy and security Data integration Æ model integration – Simulation-oriented data mapping – Geospatial alignment [e.g., Howe & Maier 2005] – Hierarchical models with different resolutions – Complex data transformations (e.g., raw simulation output to histogram) Query optimization Æ simulation-experiment optimization – Optimally configure workflow among distributed data and models – Factoring common operations across different mappings in the workflow – Avoiding redundant computations across experiments (e.g., result caching) – Statistical issues: managing pseudorandom numbers and Monte Carlo replications © 2012 IBM Corporation IBM Research Some Deep Problems Causality approximation – Fixed-point + perturbation approaches – System support – Theoretical support Transportation Model Buying & Eating Model fn (t ) 1 f n (t ), g n 1 (t ) gn (t ) 2 f n 1 (t ), g n (t ) Deep collaborative analytics – Visualizing and mining the results – Understanding and explaining results: • Provenance [e.g., J. Friere et al.] • Root-cause analysis – Trusting results • Model validation • ManyEyes++, Swivel++ f(t ) 1 f (t ), g (n t ) for t n t ,(n 1)t g (t ) 2 f (n t ), g (t ) © 2012 IBM Corporation IBM Research Conclusion Splash: – composition of heterogeneous models and data to support cross-disciplinary decision making in complex systems – Loose coupling of models through data exchange – Combines data-integration, simulation, and workflow technologies Key features SADL metadata language for curation and functionality Automated detection of data mismatches Semi-automated design of scalable data transformations (schema and time alignment) Runtime accelerators • MapReduce framework for scalable data transformations • Map-only Hadoop method for cubic-spline interpolation • Result-caching to minimize # of model executions – Experiment-manager allows sensitivity analysis, factor screening and optimization – Simulation metamodeling for real-time model exploration – – – – Many open research questions! © 2012 IBM Corporation IBM Research Questions? Splash project page: http://researcher.watson.ibm.com/researcher/view_project.php?id=3931 © 2012 IBM Corporation IBM Research Backup Slides 76 © 2012 IBM Corporation Splash Technology for Loose Coupling via Data Exchange SADL metadata language IBM Research Kepler adapted for model composition Design-time components Run-time components: - Kepler adapted for model execution - Experiment Manager (sensitivity analysis, metamodeling, optimization) 77 Data transformation tools: - Clio++ - Time Aligner (MapReduce algorithms) - Templating mechanism © 2012 IBM Corporation IBM Research Distributed SGD, Continued Divide the m-1 rows into three strata: U 1, U 2, U 3 Decompose loss function: L( x ) 1 3 L1( x ) 13 L2 ( x) 13 L3 (x ) where Ls (x ) 3 iU s Li (x ) Define (random) stratum sequence 1 , 2 , Execute SGD w.r.t. Lk at k th step in parallel Theorem: Suppose that x* = A-1b exists and - n O(n ) for some (0.5,1) - ( n n1 ) / n O( n ) - { n : n 0} is regenerative with E[11 / ] and E[X1(s)] 0 Then x (n) x * with probability 1 - Stratum sequence occasionally restarts probabilistically - Time between restarts has finite 1/ moment - Sequence spends ≈1/3 of its time on each stratum Proof: [GHS11] + Liapunov-function argument © 2012 IBM Corporation IBM Research Hadoop Implementation Physical blocks and logical splits – InputFormat operator creates splits (one split per mapper) – A split is mostly on one block – Splits are usually disjoint – Map job: each mapper first obtains all split data (small amount of data movement) – Reduce job: massive shuffling of data over network We allow splits to overlap by two rows 1 a1,1 2 a2,1 a1,2 b1 x1 a2,3 b2 3 a3,2 4 a4,3 a3,4 b3 x2 x3 a4,5 b4 5 a5,4 6 a6,5 a5,6 b5 x4 x5 a6,7 b6 x6 7 a7,6 8 a8,7 a7,8 b7 a8,9 b8 x7 x8 9 a9,8 10 a10,9 a9,10 b9 x9 a10,11 b10 x 10 11 a11,10 a11,12 b11 x11 12 a12,11 a12,13 b12 x12 13 a13,12 a13,14 b13 x13 split 1 split 2 stratum s = 1 DSGD is implemented as a map-only job (no data shuffling!) (mapper 2 modifies x7) © 2012 IBM Corporation IBM Research Hadoop Implementation Physical blocks and logical splits – InputFormat operator creates splits (one split per mapper) – A split is mostly on one block – Splits are usually disjoint – Map job: each mapper first obtains all split data (small amount of data movement) – Reduce job: massive shuffling of data over network We allow splits to overlap by two rows 1 a1,1 2 a2,1 a1,2 b1 x1 a2,3 b2 3 a3,2 4 a4,3 a3,4 b3 x2 x3 a4,5 b4 5 a5,4 6 a6,5 a5,6 b5 x4 x5 a6,7 b6 x6 7 a7,6 8 a8,7 a7,8 b7 x7 a8,9 b8 9 a9,8 10 a10,9 a9,10 b9 x8 x9 a10,11 b10 x 10 11 a11,10 a11,12 b11 x 11 12 a12,11 a12,13 b12 x12 13 a13,12 a13,14 b13 x13 split 1 split 2 stratum s = 2 DSGD is implemented as a map-only job (no data shuffling!) (mapper 2 modifies x7) © 2012 IBM Corporation IBM Research Hadoop Implementation Physical blocks and logical splits – InputFormat operator creates splits (one split per mapper) – A split is mostly on one block – Splits are usually disjoint – Map job: each mapper first obtains all split data (small amount of data movement) – Reduce job: massive shuffling of data over network We allow splits to overlap by two rows 1 a1,1 2 a2,1 a1,2 b1 x1 a2,3 b2 x2 3 a3,2 4 a4,3 a3,4 b3 a4,5 b4 x3 x4 5 a5,4 6 a6,5 a5,6 b5 a6,7 b6 x5 x6 7 a7,6 8 a8,7 a7,8 b7 x7 a8,9 b8 x8 9 a9,8 10 a10,9 a9,10 b9 x9 a10,11 b10 x 10 11 a11,10 a11,12 b11 x 11 12 a12,11 a12,13 b12 x12 13 a13,12 a13,14 b13 x13 split 1 split 2 stratum s = 3 DSGD is implemented as a map-only job (no data shuffling!) (x7 affects mapper 1) © 2012 IBM Corporation IBM Research Other Implementation Details Initial guess – Ignore off-diagonal elements – Works well due to “diagonal dominance” Stratum sequence as in [GHS11] – Meander in a stratum for a while, then jump to next stratum – Tension between thorough exploration of stratum and randomness – Visit all k rows in stratum: at each “sub-epoch” select one of k ! orders at random – Similar strategy for jumping between strata – Convergence Theorem still applies Step-size sequence – Constant during sub-epoch – “Bold driver” heuristic – Experiment with initial step size (in parallel on small subsequences) © 2012 IBM Corporation IBM Research Optimizing the Re-Use Factor for Maximum Efficiency To define (asymptotic) efficiency, consider budget-constrained setting [Fox & Glynn 1990; Glynn & Whitt 1992] Cost of producing n outputs from Model 2: Cn i;j (random) cost of producing jth observation of Yi j1 2;j j1 1;j mn n Under (large) fixed computational budget c – Number of Model 2 outputs produced: N(c) max{n 0 : Cn c} – Estimator: U(c) N(c) N(c) 1 j1 Y2;j N(c) 83 © 2012 IBM Corporation IBM Research Optimizing the Re-Use Factor II The key limit theorem as budget increases to infinity Suppose that E[1 2 Y 22 ] . Then U(c) is asymptotically N(, g() / c). where r 1 / and g() (E[1 ] E[2 ]) Var[Y2 ] 2r r (r 1) Cov[Y2 , Y 2 ] (cost per obs.) x (contributed variance per obs.) Cov[Y2 , Y 2 ] covariance of two Model 2 outputs that share a Model 1 input Thus, minimize g() [or maximize asymptotic efficiency = 1 / g() ] 84 © 2012 IBM Corporation IBM Research Proof Outline Set Wn,j n i1 Wn,1 Y2;i I[input for ith run of Model 2 is Y1;j ] Wn,2 mn mn 1 mn 1 m W m Thus n n j1 n;j n j1 Wn;j n Wn,3 By Theorem 1 in [Glynn & Whitt 1992], it suffices to show that a.s. – C / n c1 c 2 (straightforward to show) n – Wn,1 , Wn,2 , , W n,m obeys a “Lindeberg-Feller” FCLT Wn,4 Wn,j and Wn,j ' are independent for j j ' n Can establish standard “Lindeberg condition” which suffices for FCLT (Billingsley 1999) Some additional fussy details due to the cycling through Model 1 outputs 85 © 2012 IBM Corporation IBM Research Point and Interval Estimates Typical scenarios: Compute 100(1 )% confidence interval for under fixed budget c Estimate to within 100% with probability 100(1 )% Issue: n is unknown a priori (so can’t compute mn) Solution: estimate n from n0 pilot (or prior) runs Can show: n(n ) hn () (c) Wn,j is "centered" version of Wn,j n 1 (c) N(0,1) where hn () n j1 Wn,j m 2 1/2 1/2 so that CI from n runs is n z hn () / n , n z hn () / n where z is (1 ) / 2 normal quantile Can set – n c / (c1 c 2 ) – n hn0 () z / n0 86 for fixed budget 2 for fixed precision © 2012 IBM Corporation IBM Research Interface to R system for experimental design Method Provider Notes Full Factorial Design Experiment Manager Simple, fast design generation Exhaustive factor combinations -> slow execution Planor Fractional Factorial Design R – planor package Supports arbitrary factor levels Leverages R design generation Checks statistical feasibility of user’s proposed design Slow design generation, fast experiment execution Auto Planor Fractional Factorial Design R – planor package FRF2 Fractional Factorial Design R – FrF2 package http://cran.rproject.org/web/packages/FrF2/FrF2.pdf Only supports 2-level factors Fast generation Fast execution Custom User Specified Any design above may be used as basis http://cran.rproject.org/web/packages/planor/vignett es/PlanorInRmanual.pdf http://cran.rproject.org/web/packages/planor/vignett es/planorVignette.pdf Supports arbitrary factor levels Leverages R design generation Automatically finds smallest feasible experiment Slower design generation, fast experiment execution As new designs are introduced in R, the interface is in place to take advantage of these. © 2012 IBM Corporation IBM Research Standard Kriging M Y M , extrinsic uncertainty Images: SJSU © 2012 IBM Corporation IBM Research Stochastic Kriging M Y , intrinsic uncertainty MLE estimate: Images: SJSU Y Σ ,∙ Σ Σ © 2012 IBM Corporation IBM Research Optimization Process Flow - Optimizer is R code, - Orchestration via Python scripts = template-based data extraction © 2012 IBM Corporation