Slides (PDF)

IBM Research
STAC-A2TM Benchmark on POWER8®
Bishop Brock, Frank Liu, Karthick Rajamani
IBM Research, Austin Texas
{bcbrock,frankliu,karthick}@us.ibm.com
Workshop for High-Performance Computational Finance
WHPCF'15
November 20, 2015
With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA
© 2015 IBM Corporation
IBM Research
Summary
●
Workload characterization and performance analysis
demonstrate that STAC-A2TM is a well-rounded singlesystem HPC benchmark for risk analytics
●
POWER8-based Linux® systems demonstrate consistently
high performance at every scale (and we explain why)
→ Current POWER8-based systems are capable and
competitive platforms for computational finance
2
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
3
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Simplified POWER8 Core: SMT2, SMT4, SMT8
128-bit DP SIMD
Execution units are split
Between UniQs (thread sets)
DP FP
FXU
128-bit DP SIMD
BR
DP FP
LSU
DP FP
LU
FXU
CR
DP FP
LSU
LU
Crypto
Thread sets only dispatch to
'their' UniQ
UniQ0
Register file/renames
Shared equally by thread set
Up to 8 threads in 2
Thread sets;
Dispatch/Complete
up to 3 non-branch +
1 branch per cycle
per thread set
4
Regs0
BR
NB
NB
Thread Set 0
STAC-A2TM Benchmark on POWER8
UniQ1
SMT2+
Regs1
NB
NB
NB
NB
BR
Thread Set 1
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
STAC® and the STAC-A2TM Benchmark
●
Securities Technology Analysis Center (STAC)
– Coordinates the STAC Benchmark CouncilTM
– Publishes and audits several benchmarks covering technologies
important to capital markets
●
STAC-A2 Benchmark : Risk Management
– Computes numerous sensitivities (Greeks) of a multi-asset, exotic call
option (lookback, best-of) with early exercise
– Modeled by Monte Carlo simulation of the Heston stochastic volatility
model
– Priced using the Longstaff-Schwartz algorithm
– Greeks computed using finite differences
The STAC-A2 benchmark suite measures the performance, scalability, quality and energy
efficiency of any hardware/software system that is capable of implementing the benchmark
specification. In this presentation we focus only on the performance and scalability benchmarks.
5
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
STAC-A2 Benchmark Dimensions and Challenges
●
Performance/Scalability Dimensions
– Assets : The number of correlated assets : O(assets2)
– Timesteps : Granularity of discretization : O(timesteps)
– Paths : Number of Monte Carlo Paths : O(paths)
●
Challenges
– Unit-normal random number generation
– Dense matrix operations (custom correlation routine, beaucoup ILP)
– SQRT/DIVIDE–intensive Monte Carlo kernel (little ILP)
– Cache-efficient data management
– Efficient numerical methods (custom SVD routine)
– Bandwidth-limited performance in certain benchmarks
– Single-system parallel load balancing
6
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
IBM STAC-A2 Solution Structure
(b)
(c)
Arrays
RNG
Complex
Scenario
Gen.
Monte
Carlo
Sim.
LongStaffSchwartz
Pricing
Arrays
Monte Carlo simulation produces a number
of paths x timesteps arrays (scenarios) which
must be stored for later analysis.
Finite difference example: Θ =
y Δt − y
Δt
y is the unmodified option value scenario
y Δ t is the modified expiration scenario
7
STAC-A2TM Benchmark on POWER8
A master thread (a) spawns multiple worker
threads that perform Monte Carlo simulation
in parallel (b). Simulation is partitioned by
paths; Each thread creates the same pathpartition of every array. Scenarios are
(generated and) priced by cohorts of 1 or
more threads (c). Finally the master thread
computes the finite differences (d).
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Key “Greeks” Workloads + Characterization
Workload
Baseline
Large Problem
Asset Scaling
Path Scaling
Goal
Speed
Speed
Max Assets
Max Paths
Assets Timesteps Paths
5
252
25K
10
1260
100K
*
252
25K
5
252
*
Result
0.317 s
28.9 s
78
28M
Scenarios
93
308
15642*
93
Memory*
2.5 GiB
143 GiB
309 GiB
2.8 TiB
* Values unique to our solution
Approximate Profile Breakdown by Application Phase, POWER8 S824*
100%
90%
80%
Longstaff-Schwartz Pricing
Array Transpose
Monte Carlo (Heston Model)
RNG + Correlation
70%
60%
50%
40%
30%
20%
10%
0%
Baseline
Large Problem
Asset Scaling
Path Scaling
* 2 x POWER8 @ 3.52 GHz (nominal)/ 3.925 GHz (turbo); 24 total cores; 1 TiB memory; IBM XL C/C++; RHEL 7 Big-endian
8
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Memory Bandwidth Sensitivity
POWER8 S824 Relative Performance vs. Available Memory Bandwidth*
1.2
Perf. Relative to Full Bandwidth
1
1
1
0.96
1
1
0.93
0.99
1
0.82
0.8
0.72
0.67
0.6
0.44
0.4
Full
192GB/sec per socket
Half
96GB/sec per socket
Quarter
48GB/sec per socket
0.2
0
Baseline
Large
Problem
Asset Scaling
(65 Assets)
Path Scaling
(8M Paths/STC)
*Not a STAC-A2 benchmark
9
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Performance Comparisons – STAC-A2 Benchmarks
Relative Workload Performance
Max Assets
1.6
2 x POWER8
2 x Haswell EP
+ 2 x Xeon PHI
4 x Haswell-EX
NVIDIA K80
2 x Haswell-EP
+ Xeon PHI
2 x Haswell-EP
1.2
1
0.8
0.6
0.4
0.2
0
Baseline
Legend
30.0E+6
90
80
70
60
50
40
30
20
10
0
1.4
Large Problem
Max Paths
25.0E+6
20.0E+6
15.0E+6
10.0E+6
50.0E+5
00.0E-1
Asset Scaling
STAC SUT*
IBM150305
Description
2 x IBM POWER8 @ 3.52 GHz; 24 total cores; 1TiB Memory
INTC151028
2 x Intel Xeon E5-2697 v3 @ 2.6 GHz + 2 x Intel Xeon PHI 7120P @ 1.24 GHz
28 Haswell cores + 122 PHI cores; 256 GiB Memory
INTC150811
4 x Intel Xeon E7-8890 v3 @ 2.50 GHz; 72 total cores; 1 TiB Memory
Path Scaling
NVDA141116 1 x NVIDIA Tesla K80 GPU; Supermicro SYS-2027GR-TRFH Host; 128 MiB Memory
INTC140815
INTC140814
2 x Intel Xeon E5-2699 v3 @ 2.3 GHz + 1 x Intel Xeon PHI 7120A @ 1.24 GHz
36 Haswell cores + 61 PHI cores; 256 GiB Memory
2 x Intel Xeon E5-2699 v3 @ 2.3 GHz; 36 total cores; 256 GiB Memory
*For details see www.stacresearch.com/<STAC SUT>
Intel, Xeon, PHI, NVIDIA, Tesla and Supermicro are trademarks or registered trademarks of their respective owners and are used for identification purposes only
10
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Performance by SMT Mode
SMT Mode Performance Relative to SMT1, STAC-A2 Baseline Greeks Workload*
2.5
2
Perf. Relative to SMT1
2.04
2.00 1.96
2.12
1.99
1.97
1.68
1.61
1.66
SMT1
SMT2
SMT4
SMT8
1.5
1
1
1
1
0.5
0
1 Core
1 Socket
(12 Cores)
2 Sockets
(24 Cores)
*Not a STAC-A2 benchmark
11
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
System-Level Scalability Comparisons
Scaling Single Core Performance to Half/Full System Performance*
(Higher is Better, 1.0 = Ideal)
1
Fraction of theoretical speedup,
based on number of cores
0.9
2 x POWER8
Half = 12 Cores
Full = 24 cores
SMT4
4 x Haswell-EX
Half = 36 Cores
Full = 72 Cores
HyperThreading
2 x Haswell-EP
Half = 18 Cores
Full = 36 Cores
HyperThreading
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Core → Half System
Core → Full System
Experiment
*Data is from the official STAC-A2 audit reports for these systems, however these are not considered STAC-A2 benchmarks.
Please see page 10 for full details of the systems being compared here.
12
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Optimizing Transpose for Monte Carlo Simulation
Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartz
pricing effectively requires transposing time-major data into path-major storage.
t0
Data generated
time major
Simulation
ttimesteps-1
path 0
.....
t0
Data analyzed
path major
Pricing
ttimesteps-1
path 0
.....
path N-1
Our Solution:
A blocked, in-place matrix transpose using no
additional storage, taking advantage of the fact
that (post-processed) simulation data is
created one path (pair) at a time
Simple and relatively efficient despite:
• Moderately poor locality
• Requires a second pass over the data
path N-1
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
.....
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Simple heuristics based on working set size
improve performance for large working sets
2% - 9% of total benchmark run time
13
STAC-A2TM Benchmark on POWER8
Finally
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Fast, Parallel Longstaff-Schwartz Algorithm
• Least-Squares Monte Carlo (LSMC or
Longstaff-Schwartz) is an algorithm for valuing
early-exercise options.
t0
path0
• LSMC operates serially, backwards in time.
• At each time step, LSMC estimates the optimal
exercise value by least squares regression using
a cross section of simulated data, approximating
this value as a linear combination of basis
functions.
• We describe a fast, easily parallelizable LSMC
method based on a QR factorization algorithm
specific to row Vandermonde matrices.
• This approach is used for “microbenchmarks”
and for Path Scaling, where utilizing every thread
optimizes perfomrance.
• This approach was inspired by NVIDIA's
description of their LSMC algorithm for STAC-A2.
14
STAC-A2TM Benchmark on POWER8
ttimesteps-1
Thread0
...
Thread1
Thread2
pathN-1
Thread3
• Paths are partitioned by thread similar to Monte
Carlo simulation
• Threads synchronize (twice per timestep here)
and complete each timestep in lock-step
• Scalability is limited by synchronization and
communication overheads
• For the Baseline case (25,000 paths)
performance improves for up to 16 – 20 SMT4
threads per scenario
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Future Work
●
Exploiting the high-bandwidth CPU ↔ GPU NVLINKTM
unique to future OpenPOWER systems
*
Current OpenPOWER
Foundation Gold Members
* http://www.smartercomputingblog.com/power-systems/openpower-hpc-big-data-analytics/
15
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Special Notices
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public
sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document
does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation,
New Castle Drive, Armonk, NY 10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals
and objectives only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no
warranties or guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be
used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on
individual client confgurations and conditions.
Any performance data contained in this document was determined in a controlled environment. Actual results may vary
signifcantly and are dependent on many factors including system hardware confguration and software design and confguration.
Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these
measurements will be the same on generally-available systems. Some measurements quoted in this document may have been
estimated through extrapolation. Users of this document should verify the applicable data for their specifc environment.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Other company, product and service names may be trademarks or service marks of others and are used for identifcation purposes
only.
16
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Backup
17
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
POWER8 Simultaneous Multi-threading (SMT)
●
SMT modes represent different partitioning of core
resources: Fetch/Dispatch; Reg. Files; Execution Pipes
●
SMT mode is determined by the number of active hardware
threads/core, regardless of the number of threads
configured per core
– SMT mode switch is automatic as threads enter/exit idle states
– An enhancement over POWER7
●
SMT modes: SMT1, SMT2, SMT4 and SMT8 entered when
1, 2, 3 – 4, 5 – 8 threads are active respectively
●
Most STAC-A2 benchmarks run in SMT4; Some SMT2
– SMT4 here means we bind 4 software threads to 4 of 8 configured
HW threads per core
18
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Simplified POWER8 Core: SMT1
Execution units: Pairs of
Double Precision Floating
Point pipes implement
2-way DP SIMD;
1 FiXed point Unit,
1 Load-Store Unit,
1 Load Unit per side.
128-bit DP SIMD
DP FP
FXU
128-bit DP SIMD
BR
DP FP
LSU
DP FP
LU
FXU
CR
DP FP
LSU
LU
Note each DP FP can
also function as
2 x Single-Precision FP
Unified issue queues;
BRanch and Condition
Register have private
queues; Crypto issued from
either UniQ
Crypto
UniQ0
Register files/renames
are mirrored in SMT1
allowing use of either UniQ
Dispatch/Complete
up to 6 non-branch +
2 branch per cycle
19
UniQ1
Regs
BR
NB
STAC-A2TM Benchmark on POWER8
NB
Regs (Mirror)
NB
NB
NB
NB
BR
Single Thread
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Optimizing Data Transpose for Monte Carlo Sim.
Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartz
pricing effectively requires transposing time-major data into path-major storage. Traditional out-ofplace rectangular transpose requires too much memory. Traditional in-place transpose is too slow.
t0
Simulation
ttimesteps-1
path 0
Data generated
time major
.....
t0
Pricing
ttimesteps-1
path 0
Data analyzed
path major
.....
path N-1
Simple Insertion of timemajor data into a path-major
array is a disaster!
• Cache lines only partially
populated when touched
• Line may be evicted to memory
before being touched again
• Each line is in a unique virtual
page
t0
path N-1
ttimesteps-1
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Instead we use the blocked
approach illustrated on the
next slide.
20
STAC-A2TM Benchmark on POWER8
t0
ttimesteps-1
Path Pair n + 0
Insert Pair n + 0
.....
Path Pair n + 0
Path Pair n + 1
Insert Pair n + 1
.....
Etc.
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Illustration of In-Place Blocked Transpose for a 16x16 Block-Row
t0
ttimesteps-1
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
Path Pair n + 0
Path Pair n + 1
Path Pair n + 2
Path Pair n + 3
Path Pair n + 4
Path Pair n + 5
Path Pair n + 6
Path Pair n + 7
t0
t15
t16
t31
t32
t47
.....
ttimesteps-1
Path Pair n + 0
Insert Pair n + 0
Path Pair n + 7
Path Pair n + 0
Insert Pair n + 1
Path Pair n + 7
.....
Path Pair n + 0
Insert Pair n + 7
Path Pair n + 7
Path Pair n + 0
(Deferred) Block Transpose
Path Pair n + 7
Path pairs (original + antithetic) are simulated together and inserted into the blocked (16 x 16) and padded destination array block-row
such that the blocks are transposed. (Eventually) transposing the blocks then brings the data into the correct orientation.
21
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Transpose Performance
Performance of Monte Carlo Simulation by Transpose Method,
Relative to Blocked Insertion with Immediate Transpose†
Relative Performance
1.2
1
0.8
1.04
1.00
0.94
0.91
1.06
0.95
1.01
0.93
0.85
0.97
0.88
1.07
1.02
0.75
0.63
0.6
0.63
0.46
0.37
0.4
0.2
0
Baseline
(6.4)
Large Problem
(93.5)
Asset Scaling
(797)
Path Scaling
(6.4)
Workload
(Working Set, MiB)
5:126:25000
(3.2)
5:1260:25000
(32.0)
Blocked
(Immediate
Transpose)
Blocked
(Deferred
Transpose)
Simple
Insertion
Buffered
Insertion*
Not STAC-A2 Benchmarks
Any technique other than the Simple Insertion gives good performance, and all techniques benefit from manual prefetching. 2% - 9%
of total benchmark run time is currently spent doing this transpose (varies by workload).
If the working set (16 paths per simultaneously generated array, per thread, per core) appears to fit in the 8 MiB L3, it is advantageous
to transpose each block-row as soon as it is filled, otherwise it is better to defer the final block transpose until the entire array is filled.
Our latest results also suggest that Blocked Insertion with Immediate Transpose is more efficient than using a library out-of-place
transposition routine (DGETMO) for L3-contained working sets. [Each thread transposes the section of the array it created.]†
† Not a STAC-A2 Benchmark
22
* Discussed in the paper
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Parallel Longstaff-Schwartz Algorithm
●
After simulation each (complex) scenario is reduced to a
single value using the Longstaff-Schwartz algorithm
Scenario
Array
Longstaff-Schwartz
3.276517
(paths x timesteps)
●
Multi-threaded parallelization is needed in cases where the
number of threads exceeds the number of scenarios
– “Microbenchmarks”, e.g., Delta, where 96 threads price 10 scenarios
– Path Scaling: 93 scenarios are divided into 5 partitions due to limited
memory; Partitions contain 10 to 40 scenarios
●
23
We may also use parallelization heuristically to improve
load balancing, even when scenarios > threads
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Longstaff-Schwartz (Least-Squares Monte Carlo
Longstaff-Schwartz Example
[LSMC]) in a Nutshell
Polynomial Curve Fit
• Prior to expiration the holder will
exercise an in-the-money option if the
future discounted cash flow from holding
the option is expected to be less than
the current value.
• LSMC estimates the future value by
least squares regression using a cross
section of simulated data, approximating
this value as a linear combination of
basis functions.
• LSMC is robust to the choice of basis
functions; STAC-A2 specifies the use of
polynomial basis functions
24
5
Future Payoff
• The value of the option is maximized if
the exercise happens as soon as this is
true.
6
4
Exercise Now
Defer Exercise
Least-Squares
Fit
Current = Future
3
2
1
0
0
1
2
3
4
5
6
Current Payoff (Exercise Now)
Analysis performed for each timestep. The
Singular Value Decomposition (SVD) approach to
least-squares is preferred for numerical stability.
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Longstaff-Schwartz Parallelization Schemes
By Timestep
t0
By Path-Partition
ttimesteps-1
path0
...
t0
ttimesteps-1
path0
Thread0
System
Thread1
...
Thread2
pathN-1
Thread0
Thread1
Thread2
Thread3
• Threads “leapfrog” backwards in time, computing SVD in
parallel
• Final evaluation is still serial: Thread must wait for next
timestep to complete before evaluating the regression
• Scalability severely limited by ratio TSVD : Teval
25
STAC-A2TM Benchmark on POWER8
Thread3
pathN-1
• Paths are partitioned by thread similar to Monte Carlo
simulation
• Threads synchronize (twice per timestep here) and
complete each timestep in lock-step
• Scalability is limited by synchronization and
communication overheads
• For the Baseline case (25,000 paths) performance
improves for up to 16 – 20 SMT4 threads per scenario
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
Path-Parallel Least Squares Solution from QR Factorization and SVD
[
A paths× n = 1
pi
p 2i
…
… p n−1
i
…
]
• Design matrix A: STAC-A2 specifies polynomial
basis functions, hence A is a row Vandermonde
matrix
• The QR factorization of A. This is path-parallelizable
in general for any basis functions, however we use a
fast QR factorization for Vandermonde matrices
requiring trivial parallel communication
A paths× n = Q paths×n R n×n
R = U ΣV T
• The SVD of R
A = (QU )Σ V T
• The SVD of A
T
T
−1
T
x = b [(QU )Σ V ] = b
[
T
paths
W paths×n
W (1)1 W (1)2
x T = [ bT (1)bT (2)…bT ( N ) ]× W (2)1 W (2)2
…
…
W (N )1 W ( N )2
26
•Solution for coefficients xT minimizing ∥ Ax−b∥2
where b is the discounted future cash flow
⋱ W (1)n
⋱ W (2)n
⋱
…
⋱ W (N )n
STAC-A2TM Benchmark on POWER8
]
• Both bT and W can be path-partitioned between N
threads. Each thread j computes n partial coefficients
T
T
xi ( j ) = b ( j )W ( j)i for
WHPCF'15 – Nov. 20, 2015
i=1, …, n
© 2015 IBM Corporation
IBM Research
High-Level Sketch – Per-thread Operations
Loop over timesteps:
lastPath j
Mi, j =
∑
i−1
k= firstPath j
pk
for i=1,…, 2n−1
• Each thread j computes the sums of the first
2n-1 powers of its share of prices for in-themoney paths
Thread Join
R = f ( M 1, … , M 2n −1 )
W ( j) = g ( A( j) , SVD( R))
T
T
x i ( j) = b ( j)W ( j)i for
i=1,…, n
• Each thread j combines the partial sums to
construct R and its SVD, which is then used
to create partial coefficients.
Thread Join
Use the regression
A( j) x
to compute the new future value
• In total, 3 passes are made over the price
data at each timestep, stressing caches and
memory bandwidth as the number of paths
per partition increases.
• The working set here consists of 3 column
vectors: The price, the future value, and an
index vector recoding in-the-money paths.
27
STAC-A2TM Benchmark on POWER8
WHPCF'15 – Nov. 20, 2015
© 2015 IBM Corporation
IBM Research
All-Threads Cohorts vs. Small-Threads Cohorts
Greeks Throughput by Parallelization Method
and Available Memory Bandwidth
70000
60000
Full
192GB/sec
per socket
Half
96GB/sec
Per socket
Quarter
48GB/sec
Per socket
Paths per Second*
50000
40000
30000
20000
10000
• All-Threads-Cohort (ATC) mode –
using all threads to price each
scenario – significantly improves
performance and reduces memory
bandwidth dependency vs. using
cohorts with smaller numbers of
threads (Small-Threads-Cohort,
STC) for large numbers of paths.
• Speedups are primarily due to
better load balancing.
• Memory effects are due to smaller
cache footprints and ideal NUMA
locality.
path0
ttimesteps-1
t0
Thread0
0
1M
ATC
1M
STC
8M
ATC
8M
STC
16M
ATC
16M
STC
Paths - Method
* Not a STAC-A2 Benchmark
28
STAC-A2TM Benchmark on POWER8
...
pathN-1
WHPCF'15 – Nov. 20, 2015
Thread1
Thread2
Thread3
© 2015 IBM Corporation
1/7
Least Squares Fitting
Given observed pairs (xi , yi ), find a polynomial model which describes
the relationship.
y = a0 + a1 · x + a2 · x 2 + · · · + an · x n
Expanding into matrix format
2
1 x1 · · · x1n
6 1 x2 · · · x n
2
6
6 ..
..
..
4 .
.
.
n
1 xM · · · xM
Or in matrix form:
3 2
7 6
7 6
7·6
5 4
a0
a1
..
.
an
3
2
7 6
7 6
7=6
5 4
y1
y2
..
.
yM
A·x=b
(1)
3
7
7
7
5
(2)
(3)
Least squares solution minimizes the residue
arg min ||r||22 = arg min ||A · x
x
x
b||22
(4)
2/7
Solutions to Least Squares Problems
Requires more observed data points than order of polynomial:
M
n. Overdetermined problem.
Skipped existence and uniqueness ......
Several methods to solve the LS problems. For example,
pseudo-inverse method:
AT A · x = AT b
AT A is square and fully ranked, we can invert it to compute x:
⇣
⌘ 1
x = AT A
AT b
(5)
(6)
Works fine for small n. Issues when n gets large: lots of data
movement; bad condition number if not properly scaled
Better: SVD (singular value decomposition)
A = U⌃VT
(7)
Both U and V are unitary, and ⌃ is diagonal. Much more stable:
x = V⌃
1
U·b
(8)
3/7
Singular Value Decomposition via QR decomposition
One path of SVD is through the QR decomposition

R
A=Q
0
(9)
Q is orthogonal and R is upper triangular
Classic methods: Houesholder transformation or Givens rotation.
Similar to Gaussian elimination for matrix factorization, but use
orthogonal transformations
Very much a serial operation
2
3
2 ⇤
3
⇤
⇤
a11 a12 · · · a1n
a11 a12
· · · a1n
6 a21 a22 · · · a2n 7
6 0 a⇤ · · · a⇤ 7
22
2n 7
6
7
6
!
6 ..
6 ..
..
.. 7
..
.. 7
4 .
5
4
.
.
.
.
. 5
⇤
⇤
an1 an2 · · · ann
0 an2 · · · ann
(10)
4/7
Demeure’s QR Factorization Method
Take advantage of the special structure of the A to speed-up the QR
factorization.
For LS problems, the A matrix is a Vandermonde matrix
Sketch of the strategy
The QR factorization is
A = Q⌃BT
(11)
T
with Q orthogonal, B upper triangular with 1 on diagonal, ⌃ diagonal
with real values.
From Eqn. (11), if E is inverse of BT .
AE = Q⌃
(12)
HE = B⌃2
(13)
AT Q = B⌃
(14)
If H = AT A:
Also
From Eqn. (13), recursively compute E as column space of H; use E to
compute Q from Eqn. (12); use Q to compute B from Eqn. (14)
5/7
Flow
Matrix represented by vector p
2
1 p1 p12 · · · p1n+1
6 1 p2 p 2 · · · p n+1
2
2
6
A=6 .
..
..
..
.
4 .
.
.
.
n+1
2
1 pM pM · · · pM
procedure VandermondeQR
m
1
3:
for j in 2, . . . , 2n 1 do
4:
Bj,1
kpj 1 k1 / 1
1:
2:
6:
µ1
⌫1
7:
2
5:
B21
B31
1 (⌫1
µ21 )
3
7 ⇥
⇤
7
(15)
7 = 1 p p2 · · · pn+1
5
. initialization
6/7
Flow (cont’d)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
procedure VandermondeQR
. initialization cont’d
for j in 3, . . . , 2n 2 do
Bj,2
( 1 / 2 ) (Bj+1,1 µ1 Bj,1 )
p
Q:,1 = 1/ 1
p
Q:,2 = (p µ1 )/ 2
. main recursion loop
for k in 3, . . . , n do
µk 1
Bk,k 1
⌫k 1
Bk+1,k 1
⌫k 2 + µk 1 (µk 2 µk 1 ))
k
k 1 (⌫k 1
for j in k + 1, . . . , 2n k do
Bj,k
( k 1 / k ) (Bj+1,k 1 Bj,k 2 +
+(µk 2 µk 1 )Bj,k 1 )
p
Q:,k
(p k 1 / k ) · ((p + µk 2 µk 1 ) · Q:,k 1
k 1 / k 2 · Q:,k 2 )
p p
p
⌃
diag( 1 ,
2, . . . ,
n)
B
B1:n,:
7/7
Characteristics
Many operations are local, no large amount of data movements
Can be parallelized easily