René Müller — IBM Research – Almaden 23 March 2014 Building Workload Optimized Solutions for Business Analytics René Müller, IBM Research – Almaden [email protected] GPU Hash Joins with Tim Kaldewey, John McPherson BLU work with Vijayshankar Raman, Ronald Barber, Ippokratis Pandis, Richard Sidle, Guy Lohman IBM Research—Almaden Talk at FastPath 2014, Monterey, CA © 2014 IBM Corporation Spectrum of Workload Optimized Solutions Specialized Software on General-Purpose Hardware BLU for DB2 2 GPU FPGA IBM PureData System for Analytics IBM Netezza 100 ASIC Specialization © 2014 IBM Corporation Agenda § Workload: Business Intelligence, OLAP § BLU Acceleration for DB2 § Study 1: Acceleration of BLU-style Predicate Evaluation SIMD CPU vs. FPGA vs. GPU § Study 2: Hash Joins on GPUs 3 © 2014 IBM Corporation Workload: Business Intelligence Queries A data warehousing query in multiple languages ■ 4 English: Show me the annual development of revenue from US sales of US products for the last 5 years by city © 2014 IBM Corporation Workload: Business Intelligence Queries A data warehousing query in multiple languages ■ ■ English: Show me the annual development of revenue from US sales of US products for the last 5 years by city SQL: SELECT c.city, s.city, d.year, SUM(lo.revenue) FROM lineorder lo, customer c, supplier s, date d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.nation = ’UNITED STATES’ AND s.nation = ’UNITED STATES' AND d.year >= 1998 AND d.year <= 2012 GROUP BY c.city, s.city, d.year ORDER BY d.year asc, revenue desc; 5 © 2014 IBM Corporation Workload: Business Intelligence Queries Star Schema – typical for data warehouses Customer Part PARTKEY! CUSTKEY! Lineorder NAME! NAME! ORDERKEY! MFGR! ADDRESS! LINENUMBER! CATEGORY! CITY! CUSTKEY! BRAND! …! PARTKEY! …! SUPPKEY! Supplier ORDERDATE! Date SUPPKEY! ORDPRIORITY! DATEKEY! NAME! …! DATE! ADDRESS! …! DAYOFWEEK! CITY! COMMITDATE! MONTH! …! SHIPMODE! YEAR! …! 6 Query: SELECT c.city, s.city, d.year, SUM(lo.revenue) FROM lineorder lo, customer c, supplier s, date d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.nation = ’UNITED STATES’ AND s.nation = ’UNITED STATES’ AND d.year >= 1998 AND d.year <= 2012 GROUP BY c.city, s.city, d.year ORDER BY d.year asc, revenue desc; © 2014 IBM Corporation Workload: Business Intelligence Queries Workload Optimized Systems My Definition: Workload Optimized System. A system whose architecture and design have been designed for a specific and narrow range of applications. § Optimized for Performance (for us) § Primary Performance Metrics – Response time Seconds – Query throughput Queries/hour § Derived Metrics – Performance/Watt – Performance/$ à minimize à maximize Queries/hour/Watt (Queries/Wh) Queries/hour/$ § Cost of Ownership itself usually secondary goal 7 © 2014 IBM Corporation BLU Acceleration for DB2 Specialized Software on General-Purpose Hardware 8 © 2014 IBM Corporation BLU Acceleration for DB2 What is DB2 with BLU Acceleration? • Novel Engine for analy/c queries – Columnar storage, single copy of the data – New run-‐/me engine with SIMD processing – Deep mul/-‐core op/miza/ons and cache-‐aware memory management – “Ac/ve compression” -‐ unique encoding for storage reduc/on and run-‐/me processing without decompression § “Revolu/on by Evolu/on” – Built directly into the DB2 kernel – BLU tables can coexist with row tables – Query any combina/on of BLU or row data – Memory-‐op/mized (not “in-‐memory”) § Value : Order-‐of-‐magnitude benefits in … DB2 with BLU DB2 Compiler DB2 Run-time Row-wise Run-time BLU Runtime DB2 Bufferpool Storage Traditional Row Tables C1 C2 C3 C4 C5 C6 C7 C8 BLU Encoded Columnar Tables C1 C2 C3 C4 C5 C6 C7 C8 – Performance – Storage savings – Simplicity! § Available since June 2013 in DB2 v10.5 LUW 2 © 2014 IBM Corporation BLU Acceleration for DB2 BLU’s Columnar Store Engine c1 c2 c3 c4 c5 c6 c7 c1 c2 c3 c4 c5 c6 c7 § Reduce I/O by only reading the columns that are referenced by the query. § Traditional row stores read pages with complete rows. § Columns are horizontally divided into strides of rows § BLU keeps a synopsis (=summary) of min/ max values in each stride SELECT SUM(revenue) FROM lineorder WHERE quantity BETWEEN 1 AND 10 Skip strides without matches c1 c2 c3 c4 c5 c6 c7 Stride § Skip strides without matches à further reduces I/O à Skip-sequential access pattern 10 © 2014 IBM Corporation BLU Acceleration for DB2 Order-preserving frequency encoding § Most frequent values are given the shortest codes § Column values are partitioned based on code length § Values of same code length are stored bitaligned § Example: encoding of state names State 11 Dict. 0 Dict. 1 Dict. 2 Page 0 1 0 1 0 1 1 Region 0 01 Region 1 Encoding California 0 New York 1 Florida 0 Illinois 0 1 Michigan 1 0 Texas 1 1 Alaska Column: StateName 1 10 00 01 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 Tuple Map 1 V. Raman et al: DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB 6(11): 1080—1091 (2013) © 2014 IBM Corporation Operations on Compressed Data in BLU Software and Hardware SIMD Software SIMD § Process multiple tuples in parallel inside a machine word 42 42 42 42 42 ? =, <>, >, <, >=, <= Encoded column value § Bit-manipulation Operations using carry, borrow, mask and shift operations. § Exploits Instruction-level Parallelism (ILP) § Exploits Specialized Instructions – BPERMD on POWER Architecture – PEXT (BMI2) on Intel® Haswell™ Encoded literal Tuple Value Machine Word Hardware SIMD § Process >1 machine word at once § SSE 4.2 on Intel (Nehalem or later) § VMX on POWER (P7 or later) Tuple Value Machine Word SIMD register 12 R. Jonson et al. Row-wise parallel predicate evaluation. PVLDB 1(1):622—634 (2008) © 2014 IBM Corporation BLU Acceleration for DB2 Terabyte class results 140 120 133x Speedup over DB2 v10.1 (row-store) 100 80 60 44x 40 25x 18x Wall Street Cognos Dynamic Cubes 20 0 Processor Vendor Large European Benchmark ISV “It was amazing to see the faster query times compared to the performance results with our row-organized tables. The performance of four of our queries improved by over 100-fold! The best outcome was a query that finished 137x faster by using BLU Acceleration.” - Kent Collins, Database Solutions Architect, BNSF Railway 13 © 2014 IBM Corporation BLU with Specialized Hardware FPGA vs. GPU vs. BLU on Multicore SIMD CPU 14 © 2014 IBM Corporation Accelerating BLU with Specialized Hardware What portions to accelerate? Where does time go? SELECT c.city, s.city, d.year, SUM(lo.revenue) FROM lineorder lo, customer c, supplier s, date d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.nation = ’UNITED STATES’ AND s.nation = ’UNITED STATES' AND d.year >= 1998 AND d.year <= 2012 GROUP BY c.city, s.city, d.year ORDER BY d.year asc, revenue desc; Example: Time Breakdown: other 8% Load Global Code 23% Load Value 9% § Star Schema Benchmark Query 3.2 § No I/O (= warm buffer pool) All columns in memory § Which is the heavy hitter? § What is the acceleration potential? Join 11% Hashing 21% Predicate 14% Join Filter 14% 15 © 2014 IBM Corporation Accelerating BLU with Specialized Hardware Watch out for Amdahl’s Law § What speedup is achievable in the best case when accelerating Hashing? § Assume theoretic ideal scenario Hashing 21% à 0% (Processing time à 0, zero-cost transfers) Time Breakdown: other 8% Amdahl’s Law Speedup = Offload “larger portions” of the Query! 1 ≈ 1.27 1- 0.21 Load Global Code 23% Load Value 9% Join 11% § Even in ideal case speedup is only 27% § Small speedup: Just buy faster CPU, e.g, Intel® Xeon® E5-2650 v2 (2.6 GHz) à E5-2667 v2 (3.3 GHz) § HW accelerators become interesting for speedups > half order of magnitude 16 Predicate 14% Hashing 21% Join Filter 14% © 2014 IBM Corporation BLU-like Predicate Evaluation Can FPGAs or GPUs do better than HW/SW SIMD code? (a) What is the inherent algorithmic complexity? (b) What is the end-to-end performance including data transfers? 17 © 2014 IBM Corporation BLU Predicate Evaluation on CPUs CPU Multi-Threaded on 4 Socket Systems: Intel X7560 vs P7 IBM x Series x3850 (Intel®, 32 cores) 140 IBM p750 (P7, 32 cores) 140 scalar SSE 120 Aggregate Bandwidth [GiB/s] 120 Aggregate Bandwidth [GiB/s] scalar 100 VMX+bpermd 100 80 60 40 80 60 40 20 20 0 0 1 2 3 4 5 6 7 8 9 10 11 13 17 33 Tuplet Width [bits] 64 threads 18 VMX *) constant data size, vary tuplet width 1 2 3 4 5 6 7 8 9 10 11 13 17 33 Tuplet Width [bits] 128 threads © 2014 IBM Corporation BLU Predicate Evaluation on FPGAs (Equality) Predicate Core on FPGA tuplet_width input_word predicate Predicate Evaluator Word with N tuplets Instantiate comparators for all tuplet widths 19 out_valid bitvector Bit-vector N bits per input word © 2014 IBM Corporation BLU Predicate Evaluation on FPGAs Test setup on chip for area and performance w/o data transfers LFSR LFSR log2(N) N tuplet_width bank bitvector N LFSR predicate N Parity Sum 1 + enable Predicate Core 20 © 2014 IBM Corporation BLU Predicate Evaluation on FPGAs Setup on Test Chip Now, instantiate as many as possible... ... chip pin 1-bit reduction tree slower output (read) clock 21 © 2014 IBM Corporation BLU Predicate Evaluation on FPGAs Implementation on Xilinx Virtex-7 485T #cores/chip max. clock *) chip utilization 1 16 64 100 256 388 MHz 264 MHz 250 MHz 250 MHz 113 MHz 0.3% 5% 23% 30% 67% estimated power agg. throughput 0.3 W 0.6 W 1.9 W 2.7 W 6.6 W 1.5 GiB/s 16 GiB/s 60 GiB/s 93 GiB/s 108 GiB/s GTX Transceiver cap ~65 GiB/s on ingest. → Throughput is independent of tuplet width, by design. *) Given by max. delay on longest signal path in post-P&R timing analysis 22 © 2014 IBM Corporation GPU-based Implementation Data Parallelism in CUDA 23 © 2014 IBM Corporation BLU Predicate Evaluation on GPUs GPU SMX SMX CPU Core Core Core Core Core Core … Shmem Shmem … SMX Core Core Core Core … Shmem CUDA Implementation Core Core … Core 11 GiB/s Mem Controller Mem Controller PCIe up to 220 GiB/s 47 GiB/s Device Memory Host Memory GTX TITAN ($1000) 24 ■ 14 Streaming Multiprocessors (SMX) ■ 192 SIMT cores (SPs) / SM ■ Shared Memory: 48kB per SMX ■ Device Memory: 6 GB (off-chip) © 2014 IBM Corporation BLU Predicate Evaluation on GPUs 1 GPU Thread per Word and Update Device Memory Example Grid: 4 blocks, 4 threads/block, 16-bit tuplets → 4 tuplets/bank block0 block1 block2 block3 block0 block1 block2 block3 block0 atomicOr() to Device Memory Using fast atomics Device Memory Atomics in the Kepler Architecture Bit Vector in Device Memory 25 © 2014 IBM Corporation BLU Predicate Evaluation on GPUs Device Memory to Device Memory: Ignoring PCI Express Transfers 200 GPU w/o transfers 180 p750 (VMX+bpermd, 128 threads) Bandwidth [GiB/s] 160 140 Speedup = 1.61 120 100 80 60 40 20 0 2 26 3 4 5 6 7 8 9 10 Tuplet Width [bits] 11 13 17 33 © 2014 IBM Corporation BLU Predicate Evaluation on GPUs With Transfers: Zero-Copy access to/from Host Memory 160 GPU w/ transfers 140 p750 (VMX+bpermd, 128 threads) Bandwidth [GiB/s] 120 100 80 PCI Express becomes the bottleneck at ~11 GiB/s. 60 40 20 0 2 3 4 5 6 7 8 9 10 11 13 17 33 Tuplet Width [bits] 27 © 2014 IBM Corporation BLU Predicate Evaluation Summary Predicate Evaluation à Never underestimate optimized code on a general-purpose CPU à ILP and reg-reg operations, sufficient memory bandwidth 160 GPU w/ transfers 140 p750 (VMX+bpermd, 128 threads) Bandwidth [GiB/s] 120 256 cores on FPGA 100 100 cores on FPGA 80 FPGA agg. GTX Bandwidth 60 40 20 1 core on FPGA 0 2 3 4 5 6 7 8 9 10 11 13 17 33 Tuplet Width [bits] 28 © 2014 IBM Corporation Hash Joins on GPUs Taking advantage of fast device memory 29 © 2014 IBM Corporation Hash Joins on GPUs Hash Joins ■ ■ Primary data access patterns: – Scan the input table(s) for HT creation and probe – Compare and swap when inserting data into HT – Random read when probing the HT Data (memory) access on vs. 30 GPU (GTX580) CPU (i7-2600) Peak memory bandwidth [spec] 1) 179 GB/s 21 GB/s Peak memory bandwidth [measured] 2) 153 GB/s 18 GB/s Random access [measured] 2) 6.6 GB/s 0.8 GB/s Compare and swap [measured] 3) 4.6 GB/s 0.4 GB/s (1) Nvidia: 192.4 × 106 B/s ≈ 179.2 GB/s (2) 64-bit accesses over 1 GB of device memory (3) 64-bit compare-and-swap to random locations over 1 GB device memory Upper bound for: Probe Build HT © 2014 IBM Corporation Drill Down: Hash Tables on GPUs Computing Hash Functions on GTX580 – No Reads 32-bit keys, 32-bit hashes Hash Function/ Key Ingest GB/s threads Seq keys+ Hash LSB 338 Fowler-Noll-Vo 1a 129 Jenkins Lookup3 79 Murmur3 seq. keys seq. keys seq. keys seq. keys h(x) h(x) h(x) h(x) ^ ^ ^ ^ sum sum sum sum 32 111 One-at-a-time 85 CRC32 78 MD5 4.5 SHA1 0.81 Cryptographic message digests § Threads generate sequential keys § Hashes are XOR-summed locally 31 sum © 2014 IBM Corporation Drill Down: Hash Tables on GPUs Hash Table Probe: Keys and Values from/to Device Memory 32-bit hashes, 32-bit values, 1 GB hash table on device memory (load factor = 0.33) Hash Function/ Key Ingest GB/s Seq keys+ Hash HT Probe Keys: dev Values: dev LSB 338 2.4 Fowler-Noll-Vo 1a 129 2.5 Jenkins Lookup3 79 2.4 111 2.4 One-at-a-time 85 2.4 CRC32 78 2.4 MD5 4.5 1.8 SHA1 0.81 0.6 Murmur3 § Keys are read from device memory § 20% of the probed keys find match in hash table § Values are written back to device memory 32 © 2014 IBM Corporation Drill Down: Hash Tables on GPUs Probe with Result Cache: Keys and Values from/to Host Memory 32-bit hashes, 32-bit values, 1 GB hash table on device memory (load factor = 0.33) Hash Function/ Key Ingest GB/s HT Probe Keys: host Values: host LSB 338 2.4 2.3 Fowler-Noll-Vo 1a 129 2.5 2.4 Jenkins Lookup3 79 2.4 2.3 111 2.4 2.3 One-at-a-time 85 2.4 2.3 CRC32 78 2.4 2.3 MD5 4.5 1.8 1.8 SHA1 0.81 0.6 0.6 Murmur3 33 Seq keys+ Hash HT Probe Keys: dev Values: dev § Keys are read from host memory (zero-copy access) § 20% of the probed keys find match in hash table § Individual values are written back to buffer in shared memory and then coalesced to host memory (zero-copy access) © 2014 IBM Corporation Drill Down: Hash Tables on GPUs End-to-end comparison of Hash Table Probe: GPU vs. CPU 32-bit hashes, 32-bit values, 1 GB hash table (load factor = 0.33) Hash Function/ Key Ingest GB/s GTX580 i7-2600 keys: host values: host 4 cores 8 threads Speedup LSB 2.3 0.48 4.8× Fowler-Noll-Vo 1a 2.4 0.47 5.1× Jenkins Lookup3 2.3 0.46 5.0× Murmur3 2.3 0.46 5.0× One-at-a-time 2.3 0.43 5.3× CRC32 2.3 0.481) 4.8× MD5 1.8 0.11 16× SHA1 0.6 0.06 10× § Result cache used in both implementations § GPU: keys from host memory, values back to host memory § CPU: software prefetching instructions for hash table loads 34 1) Use of CRC32 instruction in SSE 4.2 © 2014 IBM Corporation From Hash Tables to Relational Joins Processing hundreds of Gigabytes in seconds § Combining GPUs fast storage. § How about reading the input tables on the fly from flash? Create hash table … … Probe hash table … § Storage solution delivering data at GPU join speed (>5.7 GB/s): – 3x 900 GB IBM Texas Memory Systems RamSan-70 SSDs – IBM Global Parallel File System (GPFS) DEMO: At IBM Information on Demand 2012 and SIGMOD 2013 35 © 2014 IBM Corporation Summary and Lessons Learned § Accelerators are not necessarily faster than well-tuned CPU code § Don’t underestimate the compute power and the aggregate memory bandwidth of generalpurpose multi-socket systems. § Don’t forget Amdahl’s Law, it will bite you. § Offload larger portions: Hashing only à Offload complete hash table § Take advantage of platform-specific advantages: – FPGA: customized data paths, pipeline parallelism – GPU: fast device memory, latency hiding through large SMT degree – CPU: OO architecture, SIMD and caches are extremely fast if used correctly 36 © 2014 IBM Corporation