Exar Optimizing Hadoop – Is Bigger Better?? [email protected] Exar Corporation 48720 Kato Road Fremont, CA 510-668-7000 March 2013 www.exar.com • Section I: Exar Introduction – Exar Corporate Overview • Section II: Big Data Pain-Points – – Debunking Top 5 Hadoop Myths 3 Main System Constraints • Section III: Hadoop Optimization Solution – Exar Hadoop Acceleration Solutions • Section IV: Benchmarking Results – – – OEM 1 Results OEM 2 Results OEM Results • Section V: Summary Exar At-A-Glance Global Leader in Data Management Solutions and Mixed Signal Components • Well Established Fabless IC Company – 42 years of history in Silicon Valley – ~ 300 Employees Worldwide – Healthy balance sheet - $229M in assets • Broad-base Component and Solution Supplier – Specialty SoCs, FPGA/ASIC Boards and Software • DCS (Data Compression & Security) – Analog Mixed Signal Components • Interface • Power • Section I: Exar Introduction – Exar Corporate Overview • Section II: Big Data Pain-Points – – Debunking Top 5 Hadoop Myths 3 Main System Constraints • Section III: Hadoop Optimization Solution – Exar Hadoop Acceleration Solutions • Section IV: Benchmarking Results – – – OEM 1 Results OEM 2 Results OEM Results • Section V: Summary It is not about Size of Big-Data Deployment Return on Investment would be defined by Optimal Utilization of Resources Is Bigger Always Better?? Debunking the Top 5 Hadoop Myths 1. More CPUs or More Storage does not mean better Analytics Increasing Number of Jobs Per Node, or, Improving Job processing time, implies more powerful Nodes….. No!!! Rack Density maximization and effective resource utilization (CPU, Storage and Memory) is the solution Debunking Top 5 Hadoop Myths 2. Operational Expenditure is a significant component of 3-5 Years TCO Capital expenditure is the primary contributor to the 3 or 5 Year TCO No!!! Operational expenditure is a significant contributor in the TCO Debunking the Top 5 Hadoop Myths 3. Storage scaling is significantly constrained by Size and Space Storage can Scale Easily No!!! Size, Space and Connectivity constrains scaling capacity Debunking the Top 5 Hadoop Myths 4. Data Nodes costs are driven by Storage rather than CPUs Compute defines the Data node cost No!!!! Storage defines the node cost, and the ratio is often as high as 10:1 (Storage to CPU) Debunking the Top 5 Hadoop Myths 5. For larger Hadoop Clusters Network (Shuffle) traffic reduction is a key Network Traffic Reduction is not relevant in Hadoop TCO No!!! 10G WAN Links are expensive. It is preferable to optimize traffic on 1G WAN Links, and avoid/minimize 10G Links Summary of Hadoop Cluster Constraints Hadoop Clusters can be Optimized for Storage, Network Bandwidth & Compute Resources Storage Capacity Server OEMs are Struggling to provide enough Capacity to keep up with every growing Data Needs E.g. – Leading Server OEM Latest Configuration supports 30 Disks/Server!!! Disk IOPs Bottleneck The biggest bottleneck for Data Analytics is the Disk IOPs limitation E.g. – Even the most optimally configured Hadoop System is struggling to get better than 80% CPU Utilization, as Disk IO bandwidth is not able to keep up, especially for high CPU Core to HDD Ratios Network Bandwidth Data is often Replicated 3 times, and Large Clusters are distributed globally. Minimizing bandwidth (across WAN) and minimizing Switch/HW Cost (across LAN) is key E.g. – A Leading eCommerce Company has 6 Clusters distributed globally, with each Cluster having 2,000-3,000 Data Nodes Exar Hadoop Optimization Solutions By optimizing CPU, Storage, Memory , & Network Bandwidth, TCO can be reduced up to 40% Can Hadoop Cluster TCO be reduced without impacting job execution time?? Exar Hadoop Acceleration Solutions can lower Cluster TCO by 20-40%!! • Section I: Exar Introduction – Exar Corporate Overview • Section II: Big Data Pain-Points – – Debunking Top 5 Hadoop Myths 3 Main System Constraints • Section III: Hadoop Optimization Solution – Exar Hadoop Acceleration Solutions • Section IV: Benchmarking Results – – – OEM 1 Results OEM 2 Results OEM Results • Section V: Summary Exar Hadoop Acceleration Solution Overview Exar Solution optimizes all the Hadoop Cluster Constraints mentioned earlier Exar Hadoop Acceleration Solution Highlights: Storage Optimization – Exar Solution uses Advanced Data Compression technique to Compress Input and Output Data, which drastically reduces Storage requirement in each Data Node CPU Optimization – Data Compression/Decompression is Offloaded from CPU, which releases additional CPU Cycles for Enhanced Data Analytics Memory Management – Exar Solution uses advanced Memory Management, which optimizes the System Memory Usage Network Bandwidth Optimization – Exar Solution Compresses Intermittent or Shuffle traffic, which optimizes Network Bandwidth Exar Hadoop Acceleration Solution Overview Exar offers a Certified Plug N Play Hadoop Acceleration solution Plug N Play Solution: No Code Change – Filter Layer SW sits below the HDFS. No APIs required. SW installs in minutes! Standard HW – Offload card supports PCIe Gen 1 and Gen 2 Linux OS Compatible – Solution supports Linux 6.X, and works across RHEL, Ubuntu and SUSE Certified by Cloudera: Solution Certified on both CDH3 and CDH4 OEM Tested: Solutions evaluated and benchmarked on leading OEM HW including IBM, HP, Dell, SuperMicro etc Big Data (Hadoop) Optimization Solution Exar Solutions Reduce Storage Requirement & Optimize System Resource Utilization A Hadoop Cluster Accelerated with AltraSTAR consists of: CeDeFS Filter Layer SW Hadoop Map/Reduce Hadoop FS Exar Hardware Accelerator CeDeFS is a transparent Filter Layer SW and sits below HDFS. No code changes are required and workflow remains the same Exar Accelerator is a FPGA based PCIe HW Accelerator 3x-6x increase in storage capacity in each node Enhanced CPU utilization and reduced runtime through I/O reduction and optimization Significantly benefits I/O bound tasks Increased data density; reduces the shuffle traffic Reduction in Power – Per Node, Per Cluster Linux System CeDeFS + CeDeFN Exar Driver Storage Volume Exar Offload Card • Section I: Exar Introduction – Exar Corporate Overview • Section II: Big Data Pain-Points – – Debunking Top 5 Hadoop Myths 3 Main System Constraints • Section III: Hadoop Optimization Solution – Exar Hadoop Acceleration Solutions • Section IV: Benchmarking Results – – – OEM 1 Results OEM 2 Results OEM Results • Section V: Summary Test Procedure Validate Exar Acceleration Solutions on Typical Hadoop Clusters Configure System to Default Hadoop Setting Establish Benchmark for Native Config (with LZO) Rerun Tests with Exar Acceleration Solution Disk Reduction Network Link Opt Large File Optimization Quantify Results; Calculate ROI Exar Hadoop Acceleration – OEM 1 Results Exar’s GX1745 based Acceleration Test Results Cluster Configuration Job Execution & Resource Req 300 TB EXAR Hadoop Accelerated Solution End-Users could reduce their Capital Expenditure up to 40%!!! Exar Hadoop Acceleration – OEM 2 Results OEM Sorted 1 TB in an industry leading time; Exar reduced the cost by 30% Servers = 10 Expansion Units = 10 Servers = 10 Expansion Units = 5 Exar Solution Exar Hadoop Acceleration – OEM 3 Results Solution gave the flexibility to increase Storage/CPU density per Rack Cluster Configuration Terasort Test on AppSystem Cluster 12 Disks Job Execution & Resource Req Single Job (512GB) 14m 15s Native LZO AltraSTAR + LZO Performance Gain Capacity Gain 1. 2. 3. Single Job (1TB) 6 Disks Multiple Job Job 2 8m 9s 16m 0s 33m 32s 19m 3s 70% 101% 76% 33m 36s Single Job (512GB) 21m 34s 12m 07s 77% Reduce cost and Improve performance through. Improve performance Remove disks or Lower Capacity disks Increase Capacity Exar Hadoop Acceleration – OEM 3 Results Exar Solution improved Analytics up to 70%, or, reduced Storage Cost up to 50% Performance Maximized Configuration Cost Minimized Configuration Exar Hadoop Accelerated Solutions Outperformed CPU solutions Implied or Calculated Results shed light on 4 of the 5 Hadoop Implementation Myths Storage Density Effective Storage per 40U Rack Cap-Ex Efficiency $$ Cap Investment 1 GB Sort Op-Ex Efficiency KWh Consumed per 1 GB Sort 1:2 1:1 261 430 N/A N/A N/A N/A 100% Exar Acceleration Ratio of CPU Cores to Hard Disks AltraSTAR Accel Gain With System Resource Optimization Acceleration Benchmarks EXAR Acceleration Parameter Definition No Efficiency Parameter 61% 27% 20% • Section I: Exar Introduction – Exar Corporate Overview • Section II: Big Data Pain-Points – – Debunking Top 5 Hadoop Myths 3 Main System Constraints • Section III: Hadoop Optimization Solution – Exar Hadoop Acceleration Solutions • Section IV: Benchmarking Results – – – OEM 1 Results OEM 2 Results OEM Results • Section V: Summary Exar Hadoop Acceleration Solution Exar Acceleration Solution optimizes all of the Hadoop Constraints Significant ROI: Highest Rack Density Lowest $$/GB Sort Most Power Efficient Optimized Network Bandwidth Flexibility: Offers flexibility to cater to both Disk IO Bound or CPU Bound Solutions Certified: Certified on all Cloudera Releases, and tested on most of the major OEM HW Conclusion • Hardware accelerated compression provides meaningful acceleration as well as added capacity • Acceleration plus added capacity means bigger jobs executed in less time • Very significant savings in both CAPEX and OPEX Ramana Jampala Vice-President – Business Development [email protected] (732) 440-1280 x238 www.exar.com