SALSA Flash-Optimized Software-Defined Storage Nikolas Ioannou, Ioannis Koltsidas, Roman Pletka, Sasa Tomic,Thomas Weigold IBM Research – Zurich Flash Memory Summit 2015 Santa Clara, CA 1 Flash Memory Summit 2015 Santa Clara, CA New Market Category of Big Data Flash § Multiple workloads don’t really need the write performance and endurance of “good’ Flash – In certain environments data is actually immutable § What matters is high density, low cost, and good read performance – Current Flash architectures are not a good fit the number of rd /3 1 h it w e v li e eBay: “We could rts as long as w o p p u s h s a fl l a writes that norm the price.” th /4 1 r fo it t e g could § IDC just introduced a new market category of Big Data Flash (March 2015) § Content repositories, media and streaming services, Big Data and analytics, NoSQL, Object storage, Web infrastructure. © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 2 At <1$/GB for raw Flash, total acquisition cost becomes the same as an HDD-based solution, with much lower TCO. - IDC Low-cost Flash technology (c-MLC, TLC) Can’t we just use low-cost SSDs? § Low-cost Flash suffers from high write latency, low endurance - E.g., TLC, 3D-NAND, c-MLC Raw low-cost SSDs are practically unusable in a real datacenter § Low-cost SSDs have limited resources, simple controllers to keep the cost as low as possible (~ $0.4 /GB!) § Therefore, they only employ simple Flash management - Sufficiently good read performance - But, limited write endurance, terrible write performance 1000" 900" 800" 700" 600" 500" 400" 300" 200" 100" 0" 2500# Write$Latency$(usec)$ Read(Performance( Latency((usec)( (4kB"random"reads)" > 50k IOPS @ 300usec 0" 10" 20" 30" 40" kIOPS( © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 50" 60" 2000# 1500# Write$Latency$ (4kB#random#writes)# 1800# 2200# Almost as slow as an HDD! 1000# 500# 53# 0# 70" MLC$PCI'e$Card$ SATA$TLC$SSD$ 15k$RPM$HDD$ 3 The characteristics of write performance 400 Write Bandwidth with varying block sizes Write Bandwidth (MB/s) 350 300 250 200 150 100 50 0 256MB seq 4kB rnd 1MB rnd 64MB rnd Sequential I/O on newly formatted drive © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 256MB rnd 512MB rnd 1024MB rnd 1536MB rnd Multiple overwrites of the drive with Random I/O 4 SoftwAre Log-Structured Array What? A Flash-optimized I/O stack that elevates the performance and endurance of consumer-level SSDs to enterprise standards. Why? Offer cost-effective all-Flash storage in public and private clouds, mainly for read-dominated workloads, complementing our high-end FlashSystem offerings. How? 1. Use high-density, low-cost, off-the-shelf Flash SSDs 2. Move complexity from hardware to software to reduce cost 3. Optimize end-to-end for low Write Amplification 4. Employ aggressive Data Reduction 5. Natively support Object Storage Squeeze the most capacity out of Flash ü Implements the state-of-the-art Flash Management in software SALSA ü Runs on Linux, exposes standard interfaces - File-systems and applications run unmodified on top of SALSA ü Is ideal for cost-optimized scale-out storage systems like GPFS, CEPH - © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA SALSA enables SDS on low-cost SSDs, offering high performance and endurance 5 SALSA Overview - Block - Object - I/O memory (RDMA) Interfaces Logical Layer SALSA Software Stack - Workload Isolation - Thin Provisioning - Compression - De-duplication - Recurring Pattern Detection - Storage Virtualization - Quality of Service - Data Reduction Physical Layer - - - - - - Log-structured organization - Flash-friendly access patterns - State-of-the-art Garbage Collection - Zero Read-Modify-Writes - RAID5-equivalent protection - Small footprint Log-Structured Array Capacity Management Traffic Shaping Load Balancing I/O handling Low-cost, high-density consumer SSDs - Limited resources (FPGA, CPU, RAM) - Light Flash Management, simple GC SATA Runs on Linux, Intel x86 and Power8 TLC © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 3D NAND c-MLC 6 Logical Layer IF SALSA Stack Block I/O Memory Object Volume 1 Volume 2 Volume 3 Thin-provisioned space Physical Layer De-duplication Global Garbage Collection Segment Grain Write Destage Buffer Parity Generation Parallel writes to SSDs © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 7 Globally Shared Overprovisioning Space Heat Segregation Write Stream Separation Recurring Pattern Detection SALSA Stack in Linux User space Device Mapper Frontend SALSA Configuration & Tooling Kernel Device Mapper Kernel Linux Block Layer SALSA Logical Layer (Linux Device Mapper devices) SALSA Physical Layer (Linux Device Mapper device) /dev/mapper/vol0 /dev/mapper/vol1 Configuration & RAS /dev/mapper/array0 Linux Block Layer SSD Device Driver Kernel Hardware © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA /dev/mapper/vol2 8 Garbage Collection I/O Experiments – Block Storage § Using SALSA in a commodity Linux server to create an array out of 5 SSDs - With RAID5-equivalent parity protection § Comparing against RAID0, RAID5 on the same SSDs 0.800 0.450 Random (100/0 R/W) - Reads 0.400 0.700 41x 0.600 13x RAID0 Latency (msec) Read Latency (msec) 0.350 Random (80/20 R/W) – Total IOPS 0.300 RAID5 0.250 SALSA 0.200 0.500 0.400 0.300 0.150 0.200 0.100 0.050 0.100 0.000 0.000 RAID0 RAID5 SALSA 0 50 100 150 200 250 300 350 Read Throughput (kIOPS) 0 20 40 60 80 Throughput (kIOPS) SALSA dramatically improves performance in the presence of writes © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 9 100 120 CEPH on SALSA 900000" § 3-node x86 cluster § 10 Gbit Ethernet network § 2 x 1TB TLC SSDs per node § Replication factor of 3 § Mixed read/write random I/O Node 2 700000" Node 3 CEPH XFS XFS CEPH"on"SALSA" 800000" (KB/s)' Throughput)(MB/s)) Node 1 Baseline"(CEPH"on"raw"SSDs)" 600000" 500000" 400000" 300000" XFS 39x SSD SSD SSD SALSA SSD SALSA SSD SALSA SSD 200000" at steady state 100000" 0" 4" 204" 404" 604" 804" 1004" 1204" 1404" 1604" 1804" 2004" 2204" 2404" 2604" 2804" 3004" 3204" 3404" 3604" Time)(seconds)) SALSA can enable CEPH on Flash with high performance at a low cost! © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 10 Performance – Virtualized TPC-E 120.00# SALSA vs. RAID5 using 5 x 1TB TLC SSDs TPC5E)Transac'onal)Throughput) Linux Guest DB2 80.00# 3.8x Host SALSA# SSD 20.00# 0.00# 0# 5000# 10000# 15000# 20000# SSD SALSA SSD 40.00# RAID5# SSD higher transactional throughput 60.00# SSD Transac'ons)Per)Seconds)(tps)) 100.00# Time)(sec)) 70.00# TPC4E)Transac'onal)Latency) 60.00# Transac'on)Latency)(msec)) TPC-E § OLTP benchmark that simulates the workload of a brokerage firm § Running against DB2 in KVM guest § 90% Reads / 10% Writes RAID5# 50.00# SALSA# 40.00# 6.4x 30.00# lower transactional latency 20.00# 10.00# 0.00# © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 0# 11 5000# 10000# Time)(sec)) 15000# 20000# Endurance § Test using an off-the-shelf low-cost SSD (0.4 $/GB). § We measured the wear of the device, as reported by vendor-specific S.M.A.R.T attributes. § Comparing the wear incurred by SALSA to the wear incurred using the raw device 140 Device Wear 120 Raw SALSA 100 4.6x 80 60 40 20 0 0 2 4 6 Full Device Writes 8 SALSA prolongs the SSD lifetime by 4.6 times! © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 12 10 Conclusion § Low-cost Flash is in high demand § Many workloads could benefit tremendously from capacity-optimized Flash Write Amplification § SALSA is a Flash-optimized storage virtualization stack for Linux - Shifts the complexity of the FTL to software - Transforms user access patterns to be as Flash-friendly as possible - Elevates the performance and endurance of low-cost SSDs to enterprise standards § File systems & applications do not need to be modified © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 13 Performance Capacity Device Lifetime Questions ? www.research.ibm.com/labs/zurich/cci/ © 2015 International Business Machines Corporation Flash Memory Summit 2015 Santa Clara, CA 14