IBM Research STAC-A2TM Benchmark on POWER8® Bishop Brock, Frank Liu, Karthick Rajamani IBM Research, Austin Texas {bcbrock,frankliu,karthick}@us.ibm.com Workshop for High-Performance Computational Finance WHPCF'15 November 20, 2015 With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA © 2015 IBM Corporation IBM Research Summary ● Workload characterization and performance analysis demonstrate that STAC-A2TM is a well-rounded singlesystem HPC benchmark for risk analytics ● POWER8-based Linux® systems demonstrate consistently high performance at every scale (and we explain why) → Current POWER8-based systems are capable and competitive platforms for computational finance 2 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research 3 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Simplified POWER8 Core: SMT2, SMT4, SMT8 128-bit DP SIMD Execution units are split Between UniQs (thread sets) DP FP FXU 128-bit DP SIMD BR DP FP LSU DP FP LU FXU CR DP FP LSU LU Crypto Thread sets only dispatch to 'their' UniQ UniQ0 Register file/renames Shared equally by thread set Up to 8 threads in 2 Thread sets; Dispatch/Complete up to 3 non-branch + 1 branch per cycle per thread set 4 Regs0 BR NB NB Thread Set 0 STAC-A2TM Benchmark on POWER8 UniQ1 SMT2+ Regs1 NB NB NB NB BR Thread Set 1 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research STAC® and the STAC-A2TM Benchmark ● Securities Technology Analysis Center (STAC) – Coordinates the STAC Benchmark CouncilTM – Publishes and audits several benchmarks covering technologies important to capital markets ● STAC-A2 Benchmark : Risk Management – Computes numerous sensitivities (Greeks) of a multi-asset, exotic call option (lookback, best-of) with early exercise – Modeled by Monte Carlo simulation of the Heston stochastic volatility model – Priced using the Longstaff-Schwartz algorithm – Greeks computed using finite differences The STAC-A2 benchmark suite measures the performance, scalability, quality and energy efficiency of any hardware/software system that is capable of implementing the benchmark specification. In this presentation we focus only on the performance and scalability benchmarks. 5 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research STAC-A2 Benchmark Dimensions and Challenges ● Performance/Scalability Dimensions – Assets : The number of correlated assets : O(assets2) – Timesteps : Granularity of discretization : O(timesteps) – Paths : Number of Monte Carlo Paths : O(paths) ● Challenges – Unit-normal random number generation – Dense matrix operations (custom correlation routine, beaucoup ILP) – SQRT/DIVIDE–intensive Monte Carlo kernel (little ILP) – Cache-efficient data management – Efficient numerical methods (custom SVD routine) – Bandwidth-limited performance in certain benchmarks – Single-system parallel load balancing 6 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research IBM STAC-A2 Solution Structure (b) (c) Arrays RNG Complex Scenario Gen. Monte Carlo Sim. LongStaffSchwartz Pricing Arrays Monte Carlo simulation produces a number of paths x timesteps arrays (scenarios) which must be stored for later analysis. Finite difference example: Θ = y Δt − y Δt y is the unmodified option value scenario y Δ t is the modified expiration scenario 7 STAC-A2TM Benchmark on POWER8 A master thread (a) spawns multiple worker threads that perform Monte Carlo simulation in parallel (b). Simulation is partitioned by paths; Each thread creates the same pathpartition of every array. Scenarios are (generated and) priced by cohorts of 1 or more threads (c). Finally the master thread computes the finite differences (d). WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Key “Greeks” Workloads + Characterization Workload Baseline Large Problem Asset Scaling Path Scaling Goal Speed Speed Max Assets Max Paths Assets Timesteps Paths 5 252 25K 10 1260 100K * 252 25K 5 252 * Result 0.317 s 28.9 s 78 28M Scenarios 93 308 15642* 93 Memory* 2.5 GiB 143 GiB 309 GiB 2.8 TiB * Values unique to our solution Approximate Profile Breakdown by Application Phase, POWER8 S824* 100% 90% 80% Longstaff-Schwartz Pricing Array Transpose Monte Carlo (Heston Model) RNG + Correlation 70% 60% 50% 40% 30% 20% 10% 0% Baseline Large Problem Asset Scaling Path Scaling * 2 x POWER8 @ 3.52 GHz (nominal)/ 3.925 GHz (turbo); 24 total cores; 1 TiB memory; IBM XL C/C++; RHEL 7 Big-endian 8 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Memory Bandwidth Sensitivity POWER8 S824 Relative Performance vs. Available Memory Bandwidth* 1.2 Perf. Relative to Full Bandwidth 1 1 1 0.96 1 1 0.93 0.99 1 0.82 0.8 0.72 0.67 0.6 0.44 0.4 Full 192GB/sec per socket Half 96GB/sec per socket Quarter 48GB/sec per socket 0.2 0 Baseline Large Problem Asset Scaling (65 Assets) Path Scaling (8M Paths/STC) *Not a STAC-A2 benchmark 9 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Performance Comparisons – STAC-A2 Benchmarks Relative Workload Performance Max Assets 1.6 2 x POWER8 2 x Haswell EP + 2 x Xeon PHI 4 x Haswell-EX NVIDIA K80 2 x Haswell-EP + Xeon PHI 2 x Haswell-EP 1.2 1 0.8 0.6 0.4 0.2 0 Baseline Legend 30.0E+6 90 80 70 60 50 40 30 20 10 0 1.4 Large Problem Max Paths 25.0E+6 20.0E+6 15.0E+6 10.0E+6 50.0E+5 00.0E-1 Asset Scaling STAC SUT* IBM150305 Description 2 x IBM POWER8 @ 3.52 GHz; 24 total cores; 1TiB Memory INTC151028 2 x Intel Xeon E5-2697 v3 @ 2.6 GHz + 2 x Intel Xeon PHI 7120P @ 1.24 GHz 28 Haswell cores + 122 PHI cores; 256 GiB Memory INTC150811 4 x Intel Xeon E7-8890 v3 @ 2.50 GHz; 72 total cores; 1 TiB Memory Path Scaling NVDA141116 1 x NVIDIA Tesla K80 GPU; Supermicro SYS-2027GR-TRFH Host; 128 MiB Memory INTC140815 INTC140814 2 x Intel Xeon E5-2699 v3 @ 2.3 GHz + 1 x Intel Xeon PHI 7120A @ 1.24 GHz 36 Haswell cores + 61 PHI cores; 256 GiB Memory 2 x Intel Xeon E5-2699 v3 @ 2.3 GHz; 36 total cores; 256 GiB Memory *For details see www.stacresearch.com/<STAC SUT> Intel, Xeon, PHI, NVIDIA, Tesla and Supermicro are trademarks or registered trademarks of their respective owners and are used for identification purposes only 10 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Performance by SMT Mode SMT Mode Performance Relative to SMT1, STAC-A2 Baseline Greeks Workload* 2.5 2 Perf. Relative to SMT1 2.04 2.00 1.96 2.12 1.99 1.97 1.68 1.61 1.66 SMT1 SMT2 SMT4 SMT8 1.5 1 1 1 1 0.5 0 1 Core 1 Socket (12 Cores) 2 Sockets (24 Cores) *Not a STAC-A2 benchmark 11 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research System-Level Scalability Comparisons Scaling Single Core Performance to Half/Full System Performance* (Higher is Better, 1.0 = Ideal) 1 Fraction of theoretical speedup, based on number of cores 0.9 2 x POWER8 Half = 12 Cores Full = 24 cores SMT4 4 x Haswell-EX Half = 36 Cores Full = 72 Cores HyperThreading 2 x Haswell-EP Half = 18 Cores Full = 36 Cores HyperThreading 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Core → Half System Core → Full System Experiment *Data is from the official STAC-A2 audit reports for these systems, however these are not considered STAC-A2 benchmarks. Please see page 10 for full details of the systems being compared here. 12 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Optimizing Transpose for Monte Carlo Simulation Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartz pricing effectively requires transposing time-major data into path-major storage. t0 Data generated time major Simulation ttimesteps-1 path 0 ..... t0 Data analyzed path major Pricing ttimesteps-1 path 0 ..... path N-1 Our Solution: A blocked, in-place matrix transpose using no additional storage, taking advantage of the fact that (post-processed) simulation data is created one path (pair) at a time Simple and relatively efficient despite: • Moderately poor locality • Requires a second pass over the data path N-1 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 ..... Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Simple heuristics based on working set size improve performance for large working sets 2% - 9% of total benchmark run time 13 STAC-A2TM Benchmark on POWER8 Finally WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Fast, Parallel Longstaff-Schwartz Algorithm • Least-Squares Monte Carlo (LSMC or Longstaff-Schwartz) is an algorithm for valuing early-exercise options. t0 path0 • LSMC operates serially, backwards in time. • At each time step, LSMC estimates the optimal exercise value by least squares regression using a cross section of simulated data, approximating this value as a linear combination of basis functions. • We describe a fast, easily parallelizable LSMC method based on a QR factorization algorithm specific to row Vandermonde matrices. • This approach is used for “microbenchmarks” and for Path Scaling, where utilizing every thread optimizes perfomrance. • This approach was inspired by NVIDIA's description of their LSMC algorithm for STAC-A2. 14 STAC-A2TM Benchmark on POWER8 ttimesteps-1 Thread0 ... Thread1 Thread2 pathN-1 Thread3 • Paths are partitioned by thread similar to Monte Carlo simulation • Threads synchronize (twice per timestep here) and complete each timestep in lock-step • Scalability is limited by synchronization and communication overheads • For the Baseline case (25,000 paths) performance improves for up to 16 – 20 SMT4 threads per scenario WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Future Work ● Exploiting the high-bandwidth CPU ↔ GPU NVLINKTM unique to future OpenPOWER systems * Current OpenPOWER Foundation Gold Members * http://www.smartercomputingblog.com/power-systems/openpower-hpc-big-data-analytics/ 15 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Special Notices Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client confgurations and conditions. Any performance data contained in this document was determined in a controlled environment. Actual results may vary signifcantly and are dependent on many factors including system hardware confguration and software design and confguration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specifc environment. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. Other company, product and service names may be trademarks or service marks of others and are used for identifcation purposes only. 16 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Backup 17 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research POWER8 Simultaneous Multi-threading (SMT) ● SMT modes represent different partitioning of core resources: Fetch/Dispatch; Reg. Files; Execution Pipes ● SMT mode is determined by the number of active hardware threads/core, regardless of the number of threads configured per core – SMT mode switch is automatic as threads enter/exit idle states – An enhancement over POWER7 ● SMT modes: SMT1, SMT2, SMT4 and SMT8 entered when 1, 2, 3 – 4, 5 – 8 threads are active respectively ● Most STAC-A2 benchmarks run in SMT4; Some SMT2 – SMT4 here means we bind 4 software threads to 4 of 8 configured HW threads per core 18 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Simplified POWER8 Core: SMT1 Execution units: Pairs of Double Precision Floating Point pipes implement 2-way DP SIMD; 1 FiXed point Unit, 1 Load-Store Unit, 1 Load Unit per side. 128-bit DP SIMD DP FP FXU 128-bit DP SIMD BR DP FP LSU DP FP LU FXU CR DP FP LSU LU Note each DP FP can also function as 2 x Single-Precision FP Unified issue queues; BRanch and Condition Register have private queues; Crypto issued from either UniQ Crypto UniQ0 Register files/renames are mirrored in SMT1 allowing use of either UniQ Dispatch/Complete up to 6 non-branch + 2 branch per cycle 19 UniQ1 Regs BR NB STAC-A2TM Benchmark on POWER8 NB Regs (Mirror) NB NB NB NB BR Single Thread WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Optimizing Data Transpose for Monte Carlo Sim. Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartz pricing effectively requires transposing time-major data into path-major storage. Traditional out-ofplace rectangular transpose requires too much memory. Traditional in-place transpose is too slow. t0 Simulation ttimesteps-1 path 0 Data generated time major ..... t0 Pricing ttimesteps-1 path 0 Data analyzed path major ..... path N-1 Simple Insertion of timemajor data into a path-major array is a disaster! • Cache lines only partially populated when touched • Line may be evicted to memory before being touched again • Each line is in a unique virtual page t0 path N-1 ttimesteps-1 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Instead we use the blocked approach illustrated on the next slide. 20 STAC-A2TM Benchmark on POWER8 t0 ttimesteps-1 Path Pair n + 0 Insert Pair n + 0 ..... Path Pair n + 0 Path Pair n + 1 Insert Pair n + 1 ..... Etc. WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Illustration of In-Place Blocked Transpose for a 16x16 Block-Row t0 ttimesteps-1 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 Path Pair n + 0 Path Pair n + 1 Path Pair n + 2 Path Pair n + 3 Path Pair n + 4 Path Pair n + 5 Path Pair n + 6 Path Pair n + 7 t0 t15 t16 t31 t32 t47 ..... ttimesteps-1 Path Pair n + 0 Insert Pair n + 0 Path Pair n + 7 Path Pair n + 0 Insert Pair n + 1 Path Pair n + 7 ..... Path Pair n + 0 Insert Pair n + 7 Path Pair n + 7 Path Pair n + 0 (Deferred) Block Transpose Path Pair n + 7 Path pairs (original + antithetic) are simulated together and inserted into the blocked (16 x 16) and padded destination array block-row such that the blocks are transposed. (Eventually) transposing the blocks then brings the data into the correct orientation. 21 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Transpose Performance Performance of Monte Carlo Simulation by Transpose Method, Relative to Blocked Insertion with Immediate Transpose† Relative Performance 1.2 1 0.8 1.04 1.00 0.94 0.91 1.06 0.95 1.01 0.93 0.85 0.97 0.88 1.07 1.02 0.75 0.63 0.6 0.63 0.46 0.37 0.4 0.2 0 Baseline (6.4) Large Problem (93.5) Asset Scaling (797) Path Scaling (6.4) Workload (Working Set, MiB) 5:126:25000 (3.2) 5:1260:25000 (32.0) Blocked (Immediate Transpose) Blocked (Deferred Transpose) Simple Insertion Buffered Insertion* Not STAC-A2 Benchmarks Any technique other than the Simple Insertion gives good performance, and all techniques benefit from manual prefetching. 2% - 9% of total benchmark run time is currently spent doing this transpose (varies by workload). If the working set (16 paths per simultaneously generated array, per thread, per core) appears to fit in the 8 MiB L3, it is advantageous to transpose each block-row as soon as it is filled, otherwise it is better to defer the final block transpose until the entire array is filled. Our latest results also suggest that Blocked Insertion with Immediate Transpose is more efficient than using a library out-of-place transposition routine (DGETMO) for L3-contained working sets. [Each thread transposes the section of the array it created.]† † Not a STAC-A2 Benchmark 22 * Discussed in the paper STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Parallel Longstaff-Schwartz Algorithm ● After simulation each (complex) scenario is reduced to a single value using the Longstaff-Schwartz algorithm Scenario Array Longstaff-Schwartz 3.276517 (paths x timesteps) ● Multi-threaded parallelization is needed in cases where the number of threads exceeds the number of scenarios – “Microbenchmarks”, e.g., Delta, where 96 threads price 10 scenarios – Path Scaling: 93 scenarios are divided into 5 partitions due to limited memory; Partitions contain 10 to 40 scenarios ● 23 We may also use parallelization heuristically to improve load balancing, even when scenarios > threads STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Longstaff-Schwartz (Least-Squares Monte Carlo Longstaff-Schwartz Example [LSMC]) in a Nutshell Polynomial Curve Fit • Prior to expiration the holder will exercise an in-the-money option if the future discounted cash flow from holding the option is expected to be less than the current value. • LSMC estimates the future value by least squares regression using a cross section of simulated data, approximating this value as a linear combination of basis functions. • LSMC is robust to the choice of basis functions; STAC-A2 specifies the use of polynomial basis functions 24 5 Future Payoff • The value of the option is maximized if the exercise happens as soon as this is true. 6 4 Exercise Now Defer Exercise Least-Squares Fit Current = Future 3 2 1 0 0 1 2 3 4 5 6 Current Payoff (Exercise Now) Analysis performed for each timestep. The Singular Value Decomposition (SVD) approach to least-squares is preferred for numerical stability. STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Longstaff-Schwartz Parallelization Schemes By Timestep t0 By Path-Partition ttimesteps-1 path0 ... t0 ttimesteps-1 path0 Thread0 System Thread1 ... Thread2 pathN-1 Thread0 Thread1 Thread2 Thread3 • Threads “leapfrog” backwards in time, computing SVD in parallel • Final evaluation is still serial: Thread must wait for next timestep to complete before evaluating the regression • Scalability severely limited by ratio TSVD : Teval 25 STAC-A2TM Benchmark on POWER8 Thread3 pathN-1 • Paths are partitioned by thread similar to Monte Carlo simulation • Threads synchronize (twice per timestep here) and complete each timestep in lock-step • Scalability is limited by synchronization and communication overheads • For the Baseline case (25,000 paths) performance improves for up to 16 – 20 SMT4 threads per scenario WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research Path-Parallel Least Squares Solution from QR Factorization and SVD [ A paths× n = 1 pi p 2i … … p n−1 i … ] • Design matrix A: STAC-A2 specifies polynomial basis functions, hence A is a row Vandermonde matrix • The QR factorization of A. This is path-parallelizable in general for any basis functions, however we use a fast QR factorization for Vandermonde matrices requiring trivial parallel communication A paths× n = Q paths×n R n×n R = U ΣV T • The SVD of R A = (QU )Σ V T • The SVD of A T T −1 T x = b [(QU )Σ V ] = b [ T paths W paths×n W (1)1 W (1)2 x T = [ bT (1)bT (2)…bT ( N ) ]× W (2)1 W (2)2 … … W (N )1 W ( N )2 26 •Solution for coefficients xT minimizing ∥ Ax−b∥2 where b is the discounted future cash flow ⋱ W (1)n ⋱ W (2)n ⋱ … ⋱ W (N )n STAC-A2TM Benchmark on POWER8 ] • Both bT and W can be path-partitioned between N threads. Each thread j computes n partial coefficients T T xi ( j ) = b ( j )W ( j)i for WHPCF'15 – Nov. 20, 2015 i=1, …, n © 2015 IBM Corporation IBM Research High-Level Sketch – Per-thread Operations Loop over timesteps: lastPath j Mi, j = ∑ i−1 k= firstPath j pk for i=1,…, 2n−1 • Each thread j computes the sums of the first 2n-1 powers of its share of prices for in-themoney paths Thread Join R = f ( M 1, … , M 2n −1 ) W ( j) = g ( A( j) , SVD( R)) T T x i ( j) = b ( j)W ( j)i for i=1,…, n • Each thread j combines the partial sums to construct R and its SVD, which is then used to create partial coefficients. Thread Join Use the regression A( j) x to compute the new future value • In total, 3 passes are made over the price data at each timestep, stressing caches and memory bandwidth as the number of paths per partition increases. • The working set here consists of 3 column vectors: The price, the future value, and an index vector recoding in-the-money paths. 27 STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation IBM Research All-Threads Cohorts vs. Small-Threads Cohorts Greeks Throughput by Parallelization Method and Available Memory Bandwidth 70000 60000 Full 192GB/sec per socket Half 96GB/sec Per socket Quarter 48GB/sec Per socket Paths per Second* 50000 40000 30000 20000 10000 • All-Threads-Cohort (ATC) mode – using all threads to price each scenario – significantly improves performance and reduces memory bandwidth dependency vs. using cohorts with smaller numbers of threads (Small-Threads-Cohort, STC) for large numbers of paths. • Speedups are primarily due to better load balancing. • Memory effects are due to smaller cache footprints and ideal NUMA locality. path0 ttimesteps-1 t0 Thread0 0 1M ATC 1M STC 8M ATC 8M STC 16M ATC 16M STC Paths - Method * Not a STAC-A2 Benchmark 28 STAC-A2TM Benchmark on POWER8 ... pathN-1 WHPCF'15 – Nov. 20, 2015 Thread1 Thread2 Thread3 © 2015 IBM Corporation 1/7 Least Squares Fitting Given observed pairs (xi , yi ), find a polynomial model which describes the relationship. y = a0 + a1 · x + a2 · x 2 + · · · + an · x n Expanding into matrix format 2 1 x1 · · · x1n 6 1 x2 · · · x n 2 6 6 .. .. .. 4 . . . n 1 xM · · · xM Or in matrix form: 3 2 7 6 7 6 7·6 5 4 a0 a1 .. . an 3 2 7 6 7 6 7=6 5 4 y1 y2 .. . yM A·x=b (1) 3 7 7 7 5 (2) (3) Least squares solution minimizes the residue arg min ||r||22 = arg min ||A · x x x b||22 (4) 2/7 Solutions to Least Squares Problems Requires more observed data points than order of polynomial: M n. Overdetermined problem. Skipped existence and uniqueness ...... Several methods to solve the LS problems. For example, pseudo-inverse method: AT A · x = AT b AT A is square and fully ranked, we can invert it to compute x: ⇣ ⌘ 1 x = AT A AT b (5) (6) Works fine for small n. Issues when n gets large: lots of data movement; bad condition number if not properly scaled Better: SVD (singular value decomposition) A = U⌃VT (7) Both U and V are unitary, and ⌃ is diagonal. Much more stable: x = V⌃ 1 U·b (8) 3/7 Singular Value Decomposition via QR decomposition One path of SVD is through the QR decomposition R A=Q 0 (9) Q is orthogonal and R is upper triangular Classic methods: Houesholder transformation or Givens rotation. Similar to Gaussian elimination for matrix factorization, but use orthogonal transformations Very much a serial operation 2 3 2 ⇤ 3 ⇤ ⇤ a11 a12 · · · a1n a11 a12 · · · a1n 6 a21 a22 · · · a2n 7 6 0 a⇤ · · · a⇤ 7 22 2n 7 6 7 6 ! 6 .. 6 .. .. .. 7 .. .. 7 4 . 5 4 . . . . . 5 ⇤ ⇤ an1 an2 · · · ann 0 an2 · · · ann (10) 4/7 Demeure’s QR Factorization Method Take advantage of the special structure of the A to speed-up the QR factorization. For LS problems, the A matrix is a Vandermonde matrix Sketch of the strategy The QR factorization is A = Q⌃BT (11) T with Q orthogonal, B upper triangular with 1 on diagonal, ⌃ diagonal with real values. From Eqn. (11), if E is inverse of BT . AE = Q⌃ (12) HE = B⌃2 (13) AT Q = B⌃ (14) If H = AT A: Also From Eqn. (13), recursively compute E as column space of H; use E to compute Q from Eqn. (12); use Q to compute B from Eqn. (14) 5/7 Flow Matrix represented by vector p 2 1 p1 p12 · · · p1n+1 6 1 p2 p 2 · · · p n+1 2 2 6 A=6 . .. .. .. . 4 . . . . n+1 2 1 pM pM · · · pM procedure VandermondeQR m 1 3: for j in 2, . . . , 2n 1 do 4: Bj,1 kpj 1 k1 / 1 1: 2: 6: µ1 ⌫1 7: 2 5: B21 B31 1 (⌫1 µ21 ) 3 7 ⇥ ⇤ 7 (15) 7 = 1 p p2 · · · pn+1 5 . initialization 6/7 Flow (cont’d) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: procedure VandermondeQR . initialization cont’d for j in 3, . . . , 2n 2 do Bj,2 ( 1 / 2 ) (Bj+1,1 µ1 Bj,1 ) p Q:,1 = 1/ 1 p Q:,2 = (p µ1 )/ 2 . main recursion loop for k in 3, . . . , n do µk 1 Bk,k 1 ⌫k 1 Bk+1,k 1 ⌫k 2 + µk 1 (µk 2 µk 1 )) k k 1 (⌫k 1 for j in k + 1, . . . , 2n k do Bj,k ( k 1 / k ) (Bj+1,k 1 Bj,k 2 + +(µk 2 µk 1 )Bj,k 1 ) p Q:,k (p k 1 / k ) · ((p + µk 2 µk 1 ) · Q:,k 1 k 1 / k 2 · Q:,k 2 ) p p p ⌃ diag( 1 , 2, . . . , n) B B1:n,: 7/7 Characteristics Many operations are local, no large amount of data movements Can be parallelized easily