Building Workload Optimized Solutions for Business Analytics

René Müller — IBM Research – Almaden
23 March 2014
Building Workload Optimized
Solutions for Business Analytics
René Müller, IBM Research – Almaden
[email protected]
GPU Hash Joins with Tim Kaldewey, John McPherson
BLU work with Vijayshankar Raman,
Ronald Barber, Ippokratis Pandis,
Richard Sidle, Guy Lohman
IBM Research—Almaden
Talk at FastPath 2014, Monterey, CA
© 2014 IBM Corporation
Spectrum of Workload Optimized Solutions
Specialized Software
on General-Purpose
Hardware
BLU for DB2
2
GPU
FPGA
IBM PureData
System for Analytics
IBM Netezza 100
ASIC
Specialization
© 2014 IBM Corporation
Agenda
§ Workload: Business Intelligence, OLAP
§ BLU Acceleration for DB2
§ Study 1: Acceleration of BLU-style Predicate Evaluation
SIMD CPU vs. FPGA vs. GPU
§ Study 2: Hash Joins on GPUs
3
© 2014 IBM Corporation
Workload: Business Intelligence Queries
A data warehousing query in multiple languages
■ 
4
English: Show me the annual development of revenue from US sales of
US products for the last 5 years by city
© 2014 IBM Corporation
Workload: Business Intelligence Queries
A data warehousing query in multiple languages
■ 
■ 
English: Show me the annual development of revenue from US sales
of US products for the last 5 years by city
SQL: SELECT c.city, s.city, d.year, SUM(lo.revenue)
FROM lineorder lo, customer c, supplier s, date d
WHERE lo.custkey = c.custkey
AND lo.suppkey = s.suppkey
AND lo.orderdate = d.datekey
AND c.nation = ’UNITED STATES’
AND s.nation = ’UNITED STATES'
AND d.year >= 1998 AND d.year <= 2012
GROUP BY c.city, s.city, d.year
ORDER BY d.year asc, revenue desc;
5
© 2014 IBM Corporation
Workload: Business Intelligence Queries
Star Schema – typical for data warehouses
Customer
Part
PARTKEY!
CUSTKEY!
Lineorder
NAME!
NAME!
ORDERKEY!
MFGR!
ADDRESS!
LINENUMBER!
CATEGORY!
CITY!
CUSTKEY!
BRAND!
…!
PARTKEY!
…!
SUPPKEY!
Supplier
ORDERDATE!
Date
SUPPKEY!
ORDPRIORITY!
DATEKEY!
NAME!
…!
DATE!
ADDRESS!
…!
DAYOFWEEK!
CITY!
COMMITDATE!
MONTH!
…!
SHIPMODE!
YEAR!
…!
6
Query:
SELECT c.city, s.city, d.year, SUM(lo.revenue) FROM lineorder lo, customer c, supplier s, date d
WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND
c.nation = ’UNITED STATES’ AND s.nation = ’UNITED STATES’ AND d.year >= 1998 AND d.year <= 2012
GROUP BY c.city, s.city, d.year ORDER BY d.year asc, revenue desc;
© 2014 IBM Corporation
Workload: Business Intelligence Queries
Workload Optimized Systems
My Definition:
Workload Optimized System.
A system whose architecture and design have been
designed for a specific and narrow range of applications.
§  Optimized for Performance (for us)
§  Primary Performance Metrics
– Response time
Seconds
– Query throughput Queries/hour
§  Derived Metrics
– Performance/Watt
– Performance/$
à minimize
à maximize
Queries/hour/Watt (Queries/Wh)
Queries/hour/$
§  Cost of Ownership itself usually secondary goal
7
© 2014 IBM Corporation
BLU Acceleration for DB2
Specialized Software on General-Purpose Hardware
8
© 2014 IBM Corporation
BLU Acceleration for DB2
What is DB2 with BLU Acceleration?
•  Novel Engine for analy/c queries – Columnar storage, single copy of the data – New run-­‐/me engine with SIMD processing – Deep mul/-­‐core op/miza/ons and cache-­‐aware memory management – “Ac/ve compression” -­‐ unique encoding for storage reduc/on and run-­‐/me processing without decompression §  “Revolu/on by Evolu/on” – Built directly into the DB2 kernel – BLU tables can coexist with row tables – Query any combina/on of BLU or row data – Memory-­‐op/mized (not “in-­‐memory”) §  Value : Order-­‐of-­‐magnitude benefits in … DB2 with BLU
DB2 Compiler
DB2 Run-time
Row-wise
Run-time
BLU Runtime
DB2 Bufferpool
Storage
Traditional Row
Tables
C1 C2 C3 C4 C5 C6 C7 C8
BLU Encoded
Columnar
Tables
C1 C2 C3 C4 C5 C6 C7 C8
– Performance – Storage savings – Simplicity! §  Available since June 2013 in DB2 v10.5 LUW 2
© 2014 IBM Corporation
BLU Acceleration for DB2
BLU’s Columnar Store Engine
c1 c2 c3 c4 c5 c6 c7
c1 c2 c3 c4
c5
c6 c7
§  Reduce I/O by only reading the columns that
are referenced by the query.
§  Traditional row stores read pages with
complete rows.
§  Columns are horizontally divided into strides
of rows
§  BLU keeps a synopsis (=summary) of min/
max values in each stride
SELECT SUM(revenue)
FROM lineorder
WHERE quantity BETWEEN 1 AND 10
Skip strides without matches
c1
c2
c3
c4
c5
c6
c7
Stride
§  Skip strides without matches
à further reduces I/O
à Skip-sequential access pattern
10
© 2014 IBM Corporation
BLU Acceleration for DB2
Order-preserving frequency encoding
§  Most frequent values are given the shortest
codes
§  Column values are partitioned based on
code length
§  Values of same code length are stored bitaligned
§  Example: encoding of state names
State
11
Dict. 0
Dict. 1
Dict. 2
Page
0 1 0 1 0 1 1
Region 0
01
Region 1
Encoding
California
0
New York
1
Florida
0
Illinois
0 1
Michigan
1 0
Texas
1 1
Alaska
Column: StateName
1
10
00
01
0 1 0 1 0 0 0 1 0 1 0
0
0
0
0
Tuple Map
1
V. Raman et al: DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB 6(11): 1080—1091 (2013)
© 2014 IBM Corporation
Operations on Compressed Data in BLU
Software and Hardware SIMD
Software SIMD
§  Process multiple tuples in parallel inside
a machine word
42 42 42 42 42
? =, <>, >, <, >=, <=
Encoded column value
§  Bit-manipulation Operations using carry,
borrow, mask and shift operations.
§  Exploits Instruction-level Parallelism (ILP)
§  Exploits Specialized Instructions
–  BPERMD on POWER Architecture
–  PEXT (BMI2) on Intel® Haswell™
Encoded literal
Tuple Value
Machine Word
Hardware SIMD
§  Process >1 machine word at once
§  SSE 4.2 on Intel (Nehalem or later)
§  VMX on POWER (P7 or later)
Tuple Value
Machine Word
SIMD register
12
R. Jonson et al. Row-wise parallel predicate evaluation. PVLDB 1(1):622—634 (2008)
© 2014 IBM Corporation
BLU Acceleration for DB2
Terabyte class results
140
120
133x
Speedup over DB2 v10.1 (row-store)
100
80
60
44x
40
25x
18x
Wall Street
Cognos Dynamic
Cubes
20
0
Processor Vendor Large European
Benchmark
ISV
“It was amazing to see the faster query times compared to the
performance results with our row-organized tables. The performance of
four of our queries improved by over 100-fold! The best outcome
was a query that finished 137x faster by using BLU Acceleration.”
- Kent Collins, Database Solutions Architect, BNSF Railway
13
© 2014 IBM Corporation
BLU with
Specialized Hardware
FPGA vs. GPU vs. BLU on Multicore SIMD CPU
14
© 2014 IBM Corporation
Accelerating BLU with Specialized Hardware
What portions to accelerate?
Where does time go?
SELECT c.city, s.city, d.year, SUM(lo.revenue)
FROM lineorder lo, customer c, supplier s, date d
WHERE lo.custkey = c.custkey
AND lo.suppkey = s.suppkey
AND lo.orderdate = d.datekey
AND c.nation = ’UNITED STATES’
AND s.nation = ’UNITED STATES'
AND d.year >= 1998 AND d.year <= 2012
GROUP BY c.city, s.city, d.year
ORDER BY d.year asc, revenue desc;
Example:
Time Breakdown:
other 8%
Load
Global
Code 23%
Load Value
9%
§  Star Schema Benchmark Query 3.2
§  No I/O (= warm buffer pool)
All columns in memory
§  Which is the heavy hitter?
§  What is the acceleration potential?
Join 11%
Hashing
21%
Predicate
14%
Join Filter
14%
15
© 2014 IBM Corporation
Accelerating BLU with Specialized Hardware
Watch out for Amdahl’s Law
§  What speedup is achievable in the best case
when accelerating Hashing?
§  Assume theoretic ideal scenario
Hashing 21% à 0%
(Processing time à 0, zero-cost transfers)
Time Breakdown:
other 8%
Amdahl’s Law
Speedup =
Offload “larger
portions” of the
Query!
1
≈ 1.27
1- 0.21
Load
Global
Code 23%
Load Value
9%
Join 11%
§  Even in ideal case speedup is only 27%
§  Small speedup: Just buy faster CPU, e.g,
Intel® Xeon® E5-2650 v2 (2.6 GHz)
à E5-2667 v2 (3.3 GHz)
§  HW accelerators become interesting for
speedups > half order of magnitude
16
Predicate
14%
Hashing
21%
Join Filter
14%
© 2014 IBM Corporation
BLU-like Predicate
Evaluation
Can FPGAs or GPUs do better than HW/SW SIMD code?
(a)  What is the inherent algorithmic complexity?
(b)  What is the end-to-end performance including data
transfers?
17
© 2014 IBM Corporation
BLU Predicate Evaluation on CPUs
CPU Multi-Threaded on 4 Socket Systems: Intel X7560 vs P7
IBM x Series x3850 (Intel®, 32 cores)
140
IBM p750 (P7, 32 cores)
140
scalar
SSE
120
Aggregate Bandwidth [GiB/s]
120
Aggregate Bandwidth [GiB/s]
scalar
100
VMX+bpermd
100
80
60
40
80
60
40
20
20
0
0
1
2
3
4
5 6 7 8 9 10 11 13 17 33
Tuplet Width [bits]
64 threads
18
VMX
*) constant data size, vary tuplet width
1
2
3
4
5 6 7 8 9 10 11 13 17 33
Tuplet Width [bits]
128 threads
© 2014 IBM Corporation
BLU Predicate Evaluation on FPGAs
(Equality) Predicate Core on FPGA
tuplet_width
input_word
predicate
Predicate
Evaluator
Word with N tuplets
Instantiate
comparators for
all tuplet widths
19
out_valid
bitvector
Bit-vector N bits per
input word
© 2014 IBM Corporation
BLU Predicate Evaluation on FPGAs
Test setup on chip for area and performance w/o data transfers
LFSR
LFSR
log2(N)
N
tuplet_width
bank
bitvector
N
LFSR
predicate
N
Parity
Sum
1
+
enable
Predicate Core
20
© 2014 IBM Corporation
BLU Predicate Evaluation on FPGAs
Setup on Test Chip
Now, instantiate as many as possible...
...
chip pin
1-bit reduction
tree
slower output (read) clock
21
© 2014 IBM Corporation
BLU Predicate Evaluation on FPGAs
Implementation on Xilinx Virtex-7 485T
#cores/chip
max. clock *)
chip
utilization
1
16
64
100
256
388 MHz
264 MHz
250 MHz
250 MHz
113 MHz
0.3%
5%
23%
30%
67%
estimated
power
agg.
throughput
0.3 W
0.6 W
1.9 W
2.7 W
6.6 W
1.5 GiB/s
16 GiB/s
60 GiB/s
93 GiB/s
108 GiB/s
GTX Transceiver cap ~65 GiB/s on ingest.
→ Throughput is independent of tuplet width, by design.
*) Given by max. delay on longest signal path in post-P&R timing analysis
22
© 2014 IBM Corporation
GPU-based
Implementation
Data Parallelism in CUDA
23
© 2014 IBM Corporation
BLU Predicate Evaluation on GPUs
GPU
SMX
SMX
CPU
Core
Core
Core
Core
Core
Core
…
Shmem
Shmem
…
SMX
Core
Core
Core
Core
…
Shmem
CUDA Implementation
Core
Core
…
Core
11 GiB/s
Mem
Controller
Mem Controller
PCIe
up to 220 GiB/s
47 GiB/s
Device Memory
Host Memory
GTX TITAN ($1000)
24
■ 
14 Streaming Multiprocessors (SMX)
■ 
192 SIMT cores (SPs) / SM
■ 
Shared Memory: 48kB per SMX
■ 
Device Memory: 6 GB (off-chip)
© 2014 IBM Corporation
BLU Predicate Evaluation on GPUs
1 GPU Thread per Word and Update Device Memory
Example Grid: 4 blocks, 4 threads/block, 16-bit tuplets → 4 tuplets/bank
block0
block1 block2
block3
block0
block1 block2
block3
block0
atomicOr() to Device Memory
Using fast atomics Device Memory
Atomics in the Kepler Architecture
Bit Vector in Device Memory
25
© 2014 IBM Corporation
BLU Predicate Evaluation on GPUs
Device Memory to Device Memory: Ignoring PCI Express Transfers
200
GPU w/o transfers
180
p750 (VMX+bpermd, 128 threads)
Bandwidth [GiB/s]
160
140
Speedup = 1.61
120
100
80
60
40
20
0
2
26
3
4
5
6
7
8
9
10
Tuplet Width [bits]
11
13
17
33
© 2014 IBM Corporation
BLU Predicate Evaluation on GPUs
With Transfers: Zero-Copy access to/from Host Memory
160
GPU w/ transfers
140
p750 (VMX+bpermd, 128 threads)
Bandwidth [GiB/s]
120
100
80
PCI Express becomes the
bottleneck at ~11 GiB/s.
60
40
20
0
2
3
4
5
6
7
8
9
10
11
13
17
33
Tuplet Width [bits]
27
© 2014 IBM Corporation
BLU Predicate Evaluation
Summary Predicate Evaluation
à  Never underestimate optimized code on a general-purpose CPU
à  ILP and reg-reg operations, sufficient memory bandwidth
160
GPU w/ transfers
140
p750 (VMX+bpermd, 128 threads)
Bandwidth [GiB/s]
120
256 cores on FPGA
100
100 cores on FPGA
80
FPGA agg. GTX Bandwidth
60
40
20
1 core on FPGA
0
2
3
4
5
6
7
8
9
10
11
13
17
33
Tuplet Width [bits]
28
© 2014 IBM Corporation
Hash Joins on GPUs
Taking advantage of fast device memory
29
© 2014 IBM Corporation
Hash Joins on GPUs
Hash Joins
■ 
■ 
Primary data access patterns:
– Scan the input table(s) for HT creation and probe
– Compare and swap when inserting data into HT
– Random read when probing the HT
Data (memory) access on
vs.
30
GPU
(GTX580)
CPU
(i7-2600)
Peak memory bandwidth [spec] 1)
179 GB/s
21 GB/s
Peak memory bandwidth [measured] 2)
153 GB/s
18 GB/s
Random access [measured] 2)
6.6 GB/s
0.8 GB/s
Compare and swap [measured] 3)
4.6 GB/s
0.4 GB/s
(1) Nvidia: 192.4 × 106 B/s ≈ 179.2 GB/s
(2) 64-bit accesses over 1 GB of device memory
(3) 64-bit compare-and-swap to random locations over 1 GB device memory
Upper bound for:
Probe
Build HT
© 2014 IBM Corporation
Drill Down: Hash Tables on GPUs
Computing Hash Functions on GTX580 – No Reads
32-bit keys, 32-bit hashes
Hash Function/
Key Ingest GB/s
threads
Seq keys+
Hash
LSB
338
Fowler-Noll-Vo 1a
129
Jenkins Lookup3
79
Murmur3
seq.
keys
seq.
keys
seq.
keys
seq.
keys
h(x)
h(x)
h(x)
h(x)
^
^
^
^
sum
sum
sum
sum
32
111
One-at-a-time
85
CRC32
78
MD5
4.5
SHA1
0.81
Cryptographic message
digests
§  Threads generate sequential keys
§  Hashes are XOR-summed locally
31
sum
© 2014 IBM Corporation
Drill Down: Hash Tables on GPUs
Hash Table Probe: Keys and Values from/to Device Memory
32-bit hashes, 32-bit values, 1 GB hash table on device memory (load factor = 0.33)
Hash Function/
Key Ingest GB/s
Seq keys+
Hash
HT Probe
Keys: dev
Values: dev
LSB
338
2.4
Fowler-Noll-Vo 1a
129
2.5
Jenkins Lookup3
79
2.4
111
2.4
One-at-a-time
85
2.4
CRC32
78
2.4
MD5
4.5
1.8
SHA1
0.81
0.6
Murmur3
§  Keys are read from device memory
§  20% of the probed keys find match in hash table
§  Values are written back to device memory
32
© 2014 IBM Corporation
Drill Down: Hash Tables on GPUs
Probe with Result Cache: Keys and Values from/to Host Memory
32-bit hashes, 32-bit values, 1 GB hash table on device memory (load factor = 0.33)
Hash Function/
Key Ingest GB/s
HT Probe
Keys: host
Values: host
LSB
338
2.4
2.3
Fowler-Noll-Vo 1a
129
2.5
2.4
Jenkins Lookup3
79
2.4
2.3
111
2.4
2.3
One-at-a-time
85
2.4
2.3
CRC32
78
2.4
2.3
MD5
4.5
1.8
1.8
SHA1
0.81
0.6
0.6
Murmur3
33
Seq keys+
Hash
HT Probe
Keys: dev
Values: dev
§  Keys are read from host memory (zero-copy access)
§  20% of the probed keys find match in hash table
§  Individual values are written back to buffer in shared memory
and then coalesced to host memory (zero-copy access)
© 2014 IBM Corporation
Drill Down: Hash Tables on GPUs
End-to-end comparison of Hash Table Probe: GPU vs. CPU
32-bit hashes, 32-bit values, 1 GB hash table (load factor = 0.33)
Hash Function/
Key Ingest GB/s
GTX580
i7-2600
keys: host
values: host
4 cores
8 threads
Speedup
LSB
2.3
0.48
4.8×
Fowler-Noll-Vo 1a
2.4
0.47
5.1×
Jenkins Lookup3
2.3
0.46
5.0×
Murmur3
2.3
0.46
5.0×
One-at-a-time
2.3
0.43
5.3×
CRC32
2.3
0.481)
4.8×
MD5
1.8
0.11
16×
SHA1
0.6
0.06
10×
§  Result cache used in both implementations
§  GPU: keys from host memory, values back to host memory
§  CPU: software prefetching instructions for hash table loads
34
1) Use of CRC32 instruction in SSE 4.2
© 2014 IBM Corporation
From Hash Tables to Relational Joins
Processing hundreds of Gigabytes in seconds
§ Combining GPUs fast storage.
§ How about reading the input tables on the fly from flash?
Create hash table
… …
Probe hash table
…
§ Storage solution delivering data at GPU join speed (>5.7 GB/s):
– 3x 900 GB IBM Texas Memory Systems RamSan-70 SSDs
– IBM Global Parallel File System (GPFS)
DEMO: At IBM Information on Demand 2012 and SIGMOD 2013
35
© 2014 IBM Corporation
Summary and Lessons Learned
§  Accelerators are not necessarily faster than well-tuned CPU code
§  Don’t underestimate the compute power and the aggregate memory bandwidth of generalpurpose multi-socket systems.
§  Don’t forget Amdahl’s Law, it will bite you.
§  Offload larger portions: Hashing only à Offload complete hash table
§  Take advantage of platform-specific advantages:
–  FPGA: customized data paths, pipeline parallelism
–  GPU: fast device memory, latency hiding through large SMT degree
–  CPU: OO architecture, SIMD and caches are extremely fast if used correctly
36
© 2014 IBM Corporation