DARC: Design and Evaluation of an I/O Controller for Data Protection M. Fountoulakis, M. Marazakis, M. Flouris, and A. Bilas {mfundul,maraz,flouris,bilas}@ics.forth.gr Institute of Computer Science (ICS) Foundation for Research and Technology – Hellas (FORTH) Ever increasing demand for storage capacity 6X growth 2006: 161 Exabytes 2010: 988 Exabytes ¼ newly created, ¾ replicas 70% created by individuals 95% unstructured [ source: IDC report on “The Expanding Digital Universe”, 2007 ] 2 SYSTOR 2010 - DARC Motivation With increased capacity comes increased probability for unrecoverable read errors URE probability ~ 10-15 for FC/SAS drives (10-14 for SATA) “Silent” errors, i.e. exposed only when data are consumed by applications – much later than write Dealing with silent data errors on storage devices becomes critical as more data are stored on-line, on low-cost disks Accumulation of data copies (verbatim or minor edits) Increased probability for human errors Device-level & controller-level defenses in enterprise storage Disks with EDC/ECC for stored data (520-byte sectors, background data-scrubbing) Storage controllers for continuous data protection (CDP) What about mainstream systems? 3 example: mid-scale direct-attached storage servers SYSTOR 2010 - DARC Our Approach: Data Protection in the Controller (1) Use persistent checksums for error detection If error is recovered use second copy of mirror for recovery (2) Use versioning for dealing with human errors After failure, revert to previous version Perform both techniques transparently to (a) Devices: can use any type of (low-cost) devices (b) File-system and host OS (only a “thin” driver is needed) Potential for high-rate I/O Make use of specialized data-path & hardware resources Perform (some) computations on data while they are on transit Offloading work from Host CPUs, making use of specialized data-path in the controller 4 SYSTOR 2010 - DARC Technical Challenges: Error Detection Compute EDC, per data block, on the common I/O path Maintain persistent EDC per data block Minimize impact of EDC retrieval Minimize impact of EDC calculation & comparison Large amounts of state/control information needs to be computed, stored, and updated in-line with I/O processing 5 SYSTOR 2010 - DARC Technical Challenges: Versioning Versioning of storage volumes timeline of volume snapshots Which blocks belong to each version of a volume? Maintain persistent data structures that grow with the capacity of the original volumes Updated upon each write, accessed for each read as well Need to sustain high I/O rates for versioned volumes, keeping a timeline of written blocks & purging blocks from discarded versions … while verifying the integrity of the accessed data blocks 6 SYSTOR 2010 - DARC Outline Motivation & Challenges Controller Design Host-Controller Communication Buffer Management Context & Transfer Scheduling Storage Virtualization Services Evaluation Conclusions 7 SYSTOR 2010 - DARC Host-Controller Communication Options for transfer of commands PIO vs DMA PIO: simple, but with high CPU overhead DMA: high throughput, but completion detection is complicated Options: Polling, Interrupts I/O commands [ transferred via Host-initiated PIO ] SCSI command descriptor block + DMA segments DMA segments reference host-side memory addresses I/O completions [transferred via Controller-initiated DMA ] 8 Status code + reference to originally issued I/O command SYSTOR 2010 - DARC Controller memory use Use of memory in the controller: Pages to hold data to be read from storage devices Pages to hold data being written out by the Host I/O command descriptors & status information Overhead of memory mgmt is critical for I/O path State-tracking “scratch-space” needed per I/O command Arbitrary sizes may appear in DMA segments 9 Not matching block-level I/O size & alignment restrictions Dynamic arbitrary-size allocations using Linux APIs are expensive at high I/O rates SYSTOR 2010 - DARC Buffer Management Buffer pools Pre-allocated, fixed-size O(1) allocation/de-allocation overhead Lazy de-allocation De-allocate when: 2 classes: 64KB for application data, 4KB for control information Trade-off between space-efficiency and latency Idle, or under extreme memory pressure Command & completion FIFO queues 10 Host-Controller communication Statically allocated Fixed size elements SYSTOR 2010 - DARC Context Scheduling Identify I/O path stages Map stages to threads Don’t use FSMs: difficult to extend in complex designs Each stage serves several I/O requests at a time Explicit thread scheduling Yield when waiting Overlap transfers with computation I/O commands and completions in-flight while device transfers are being initiated Avoid starvation/blocking of either side! No processing in IRQ context Default fair scheduler vs static FIFO scheduler 11 Yield behavior SYSTOR 2010 - DARC I/O Path – WRITE (no cache, CRC) From Host ISSUE work-queue NEW-WRITE work-queue submit_bio() OLD-WRITE work-queue [ CRC compute ] ADMA channel SAS/SCSI controller I/O Completion (soft-IRQ handler) IRQ Check for DMA completion To Host [ CRC store ] WRITE-COMPLETION work-queue 12 SYSTOR 2010 - DARC I/O Path – READ (no cache, CRC) From Host To Host I/O Completion (soft-IRQ handler) ISSUE work-queue [ CRC lookup & check ] READ-COMPLETION work-queue OLD-READ work-queue NEW-READ work-queue [ CRC compute ] Check for DMA completion submit_bio() ADMA channel IRQ SAS/SCSI controller 13 SYSTOR 2010 - DARC Storage Virtualization Services DARC uses the Violin block-driver framework for volume virtualization & versioning M. Flouris and A. Bilas – Proc. MSST, 2005 Volume management: RAID-10 EDC checking (32-bit CRC32-C checksum per 4KB) Versioning Timeline of snapshots of storage volumes Persistent data-structures, accessed & updated in-line with each I/O access: 14 logical-to-physical block map live-block map block-version map SYSTOR 2010 - DARC Storage Virtualization Layers in DARC Controller Host-Controller Communication & I/O Command Processing Versioning RAID-0 RAID-1 RAID-1 EDC EDC EDC EDC /dev/sda /dev/sdb /dev/sdc /dev/sdd 15 SYSTOR 2010 - DARC Block-level metadata issues Performance Every read & write request requires metadata lookup Metadata I/Os are small-sized, random, and synchronous Can we just store the metadata in memory ? Memory footprint For translation tables: 64-bit address per 4KB block 2 GBytes per TByte of disk-space Too large to fit in memory! Solution: metadata cache Persistence 16 Metadata are critical: losing metadata results in data loss! Writes induce metadata updates to be written to disk Only safe way to be persistent is synchronous writes too slow! Solutions: journaling, versioning, use of NVRAM, … SYSTOR 2010 - DARC What about controller on-board caching ? Typically, I/O controllers have an on-board data cache: Many intertwined design decisions needed … Exploit temporal locality (recently-accessed data blocks) Read-ahead for spatial locality (prefetch adjacent data blocks) Coalescing small writes (e.g. partial-stripe updates with RAID-5/6) RAID levels affect cache implementation: Performance Failures (degraded RAID operation) DARC has a simple block-cache, but it is not enabled in the evaluation experiments reported in this paper. All available memory is used for buffers to hold in-progress I/O commands, their associated data _and_ metadata for the data protection functionality. 17 I/O Path Design & Implementation Outline Motivation & Challenges Controller Design Evaluation Host-Controller Communication Buffer Management Context & Transfer Scheduling Storage Virtualization Services IOP348 embedded platform Micro-measurements & Synthetic I/O patterns Application Benchmarks Conclusions 18 SYSTOR 2010 - DARC Experimental Platform Intel 81348-based development kit 8 SAS HDDs Seagate Cheetah 15.5k (15k RPM, 72GB) Host: MS Windows 2003 Server (32-bit) 2 XScale CPU cores - DRAM: 1GB Linux 2.6.24 + Intel patches (isc81xx driver) Tyan S5397, DRAM: 4 GB Comparison with ARC-1680 SAS controller 19 Same hardware platform as our dev. kit SYSTOR 2010 - DARC I/O Stack in DARC - “DAta pRotection Controller” 20 SYSTOR 2010 - DARC Intel IOP348 Data Path SRAM (128 KB) • DMA engines • Special-purpose data-path • Messaging Unit 21 SYSTOR 2010 - DARC Intel IOP348 [ Linux 2.6.24 kernel (32-bit) + Intel IOP patches (isc81xx driver) ] 22 SYSTOR 2010 - DARC “Raw” DMA Throughput DMA Throughput host-to-HBA HBA-to-host 1800 1600 MB/sec 1400 1200 1000 800 4 23 8 16 32 transfer size (KB) SYSTOR 2010 - DARC 64 Streaming I/O Throughput RAID-0, IOmeter RS pattern [ 8 SAS HDDs ] RS Iometer Pattern MB/sec DARC ARC-1680 DARC (LARGE-SG) DARC, DFLT ALLOC 1050 900 750 600 450 300 150 0 Throughput collapse! 1 2 4 8 16 queue-depth 24 SYSTOR 2010 - DARC 32 64 IOmeter results: RAID-10, OLTP pattern OLTP (4KB) Iometer Pattern queue-depth ARC-1680 64 16 4 1 0 500 1000 IOPS 25 DARC SYSTOR 2010 - DARC 1500 2000 IOmeter results: RAID-10, FS pattern FS Iometer Pattern queue-depth ARC-1680 DARC 64 16 4 1 0 500 1000 IOPS SYSTOR 2010 - Data pRotection Controller 26 1500 2000 TPC-H (RAID-10, 10-query sequence) TPCH - Execution Time seconds 2000 1800 1600 1400 1200 1000 800 600 400 200 0 ARC-1680 DARC, NO-EDC +12% DARC, EDC configuration 27 SYSTOR 2010 - DARC +2.5% DARC, EDC, VERSION JetStress (RAID-10, 1000 mboxes, 1.0 IOPS per mbox) JetStress results (IOPS) ARC-1680, write-through DARC, EDC ARC-1680, write-back DARC, NO-EDC DARC, EDC, VERSION Log Volume Data Volume Data Volume (WRITE) Data Volume (READ) 0 28 200 400 600 SYSTOR 2010 - DARC 800 1000 1200 1400 1600 Conclusions Incorporation of data protection features in a commodity I/O controller integrity protection using persistent checksums versioning of storage volumes Several challenges in implementing an efficient I/O path between the host machine & the controller Based on a prototype implementation, using real hardware: 29 Overhead of EDC checking: 12 - 20% Depending on # concurrent I/Os Overhead of versioning: 2.5 - 5% With periodic (frequent) capture & purge Depending on number and size of writes SYSTOR 2010 - DARC Lessons learned from prototyping effort CPU overhead at controller is an important limitation At high I/O rates We expect CPU to issue/manage more operations on data in the future Offload on every opportunity Essential to be aware of data-path intricacies To achieve high I/O rates Overlap transfers efficiently To/from host To/from storage devices Emerging need for handling persistent metadata 30 Along the common I/O path, with increasing complexity Increased consumption of storage controller resources SYSTOR 2010 - DARC Thank you for your attention! Questions? “DARC: Design and Evaluation of an I/O Controller for Data Protection” Manolis Marazakis, [email protected] http://www.ics.forth.gr/carv 31 SYSTOR 2010 - DARC Silent Error Recovery using RAID-1 and CRCs 32 SYSTOR 2010 - DARC Recovery Protocol Costs Case Data I/Os CRC I/Os CRC calc’s Outcome RAID-1 pair data differ, CRC matches one block 3 0 2 Data recovery, re-issue I/O RAID-1 pair data identical, CRC does not match 2 1 2 CRC recovery RAID-1 pair data differ, CRC does not match 2 0 2 Data error, Alert Host 33 SYSTOR 2010 - DARC Selection of Memory Regions Non-cacheable, no write-combining for Non-cacheable + write-combining for controller’s hardware-resources (control registers) controller outbound PIO to host memory DMA descriptors Completion FIFO Intel SCSI driver command allocations Cacheable + write-combining CRCs: allocated along with other data to be processed Command FIFO 34 explicit cache management explicit cache management SYSTOR 2010 - DARC Completion FIFO PIO Command FIFO 35 DMA PCI Express Completion FIFO SYSTOR 2010 - DARC Command FIFO Completion FIFO DMA dequeue enqueue I/O completions SCSI commands Issue Thread Read Completion Thread CRC generation Block IO Reads Thread Issue Path Integrity Check Block IO Writes Thread SCSI-to-block Translation Read DMA Thread Completion Path Storage Services 36 Complete I/O DMA Writes DMA Thread SYSTOR 2010 - DARC Write Completion Thread schedule completion processing Interrupt Context Prototype Design Summary Challenge Design Decision Host-Controller I/F PIO for commands/completions, DMA for data Buffer management Pre-allocated buffer pools, lazy de-allocation, fixedsize ring buffers (command/completion FIFOs) Context scheduling Map stages to work-queues (threads), explicit scheduling, no processing in IRQ-context On-board Cache [ Optional ] for data-blocks, “closest” to host Data Protection Violin framework within the Linux kernel: RAID-10 volumes, versioning (based on re-map), persistent metadata - including EDC CRC32-C checksums, computed per-4KB by DMA engine during transfers, persistently stored (within dedicated metadata space) 37 SYSTOR 2010 - DARC Impact of PIO on DMA Throughput Impact of host-issued PIO on DMA Throughput host-issued PIO? 2-way to-host 8KB DMA transfers from-host ON OFF 0 200 400 600 800 1000 1200 MB/sec 38 SYSTOR 2010 - DARC 1400 1600 1800 2000 2200 IOP348 Micro-benchmarks IOP348 clock cycle 0.833 nsec (1.2 GHz) Interrupt delay, CTX SW 837 nsec 1004.8 cycles Memory store 99 nsec 118.8 cycles Local-bus store 30 nsec 36 cycles Outbound store (PIO write, to host) 114 nsec 136.8 cycles Outbound load (PIO read, from host) 674 nsec 809.1 cycles Outbound load with DMA transfers 3390 ns 4069.6 cycles Outbound load with DMA transfers and inbound PIO-Writes from host 5970 ns 7166.8 cycles Host clock cycle: 0.5 nsec (2.0 GHz) Host –initiated PIO write: 100 nsec (200 cycles) 39 SYSTOR 2010 - DARC Impact of Linux Scheduling Policy [ with PIO completions ] RS Iometer Pattern ARC-1680 DMA (to-host) DARC (FAIR-SCHED) DARC (FIFO-SCHED) 1800 1600 MB/sec 1400 1200 1000 800 600 400 200 0 1 2 4 8 16 queue-depth 40 SYSTOR 2010 - DARC 32 64 I/O Workloads IOmeter patterns: RS, WS 64KB sequential read/write stream random 4KB I/O (33% writes) file-server (random, misc. sizes, 20% writes) OLTP (4KB) FS WEB web-server (random, misc. sizes, 100% reads) 68% 4KB, 15% 8KB, 2% 16KB, 6% 32KB, 7% 64KB, 1% 128KB, 1% 512KB Database workload: TPC-H 80% 4KB, 2% 8KB, 4% 16KB, 4% 32KB, 10% 64KB (4GB dataset, 10 queries) Mail server workload: JetStress (1000 100MB mailboxes, 1.0 IOPS/mbox) 41 25% insert, 10% delete, 50% replace, 15% read SYSTOR 2010 - DARC 41 Co-operating Contexts (simplified) ISSUE SCSI command pickup, SCSI control commands SCSI completions Pre-allocated Buffer Pools + Lazy Deallocation Data for Writes Data for Reads DMA from host DMA to host BIO END_IO block-level I/O issue SCSI completion to Host 42 SYSTOR 2010 - DARC Application DMA Channel (ADMA) Device interface: chain of transfer descriptors Transfer descriptor := (SRC, DST, byte-count, control-bits) SRC, DST: physical addresses, at host or controller Chain of descriptors is held in controller memory … and may be expanded at run-time Completion detection: ADMA channels report (1) running/idle state, and (2) address of the descriptor for the currently-executing (or last) transfer Ring-buffer of transfer descriptor IDs: (Transfer Descriptor Address, Epoch) Reserve/release out-of-order, as DMA transfers complete •DMA_Descriptor_ID post_DMA_transfer(Host Address, Controller Address, Direction of Transfer, Size of Transfer, CRC32C Address) •Boolean is_DMA_transfer_finished(DMA Descriptor Identifier) 43 SYSTOR 2010 - DARC Command FIFO: Using DMA New-head : valid queue element head : element to enqueue : valid element to dequeue Host tail PCIe interconnect head DMA tail Controller New-tail 44 Controller initiates DMA -Needs to know tail at Host -Host needs to know head at Controller SYSTOR 2010 - DARC Command FIFO: Using PIO head : valid queue element tail : element already enqueued Host pointer updates PIO PCIe interconnect head tail tail head Controller New-tail 45 Host executes PIO-Writes -Needs to know head at Controller -Controller needs to know tail at Host SYSTOR 2010 - DARC Completion FIFO PIO is expensive for controller CPU We use DMA for Completion FIFO queue Completion transfers can be piggy-backed on data transfers 46 For reads SYSTOR 2010 - DARC Command & Completion FIFO Implementation IOP348 ATU-MU provides circular queues 4 byte elements Up to 128KB Significant management overheads Instead, we implemented FIFOs entirely in software Memory-mapped across PCIe 47 For DMA and PIO direct access SYSTOR 2010 - DARC Context Scheduling Multiple in-flight I/O commands at any one time I/O command processing actually proceeds in discrete stages, with several events/notifications being triggered at each Option-I: Event-driven Design (and tune) dedicated FSM Many events during I/O processing Option-II: Thread-based Eg: DMA transfer start/completion, disk I/O start/completion, … Encapsulate I/O processing stages in threads, schedule threads We have used Thread-based, using full Linux OS 48 Programmable, infrastructure in-place to build advanced functionality more easily … but more s/w layers, with less control over timing of events/interactions SYSTOR 2010 - DARC Scheduling Policy Threads (work-queues) instead of FSMs Default Linux scheduler (SCHED_OTHER) is not optimal Simpler to develop/re-factor code & debug Can block independently from one another Threads need to be explicitly pre-empted when polling on a resource Events are grouped within threads Custom scheduling, based on SCHED_FIFO policy Static priorities, no time-slicing (run-until-complete/yield) All threads at same priority level (strict FIFO), no dynamic thread creation Thread order precisely follows the I/O path 49 Crucial to understand the exact sequence of events With explicit yield() when polling, or when "enough" work has been done - always yield() when a resource is unavailable SYSTOR 2010 - DARC Controller On-Board Cache Typically, I/O controllers have an on-board cache: Exploit temporal locality (recently-accessed data blocks) Read-ahead for spatial locality (prefetch adjacent data blocks) Coalescing small writes (e.g. partial-stripe updates with RAID-5/6) Many design decisions needed RAID affects cache implementation 50 Performance Failures (degraded RAID operation) I/O Path Design & Implementation On-board Cache Design Decisions Placement of the cache Mapping function & associativity Near the host interface, near the storage devices Replacement policy Handling of writes Write-back, write-through Write-allocate, write no-allocate Handling of partial hits/misses Concurrency / Contention Many in-flight requests Dependencies between pending accesses Hit-under-miss, mapping conflicts Contention for individual blocks 51 Cache access involves several steps (DMA and I/O issue/completion) E.g: Read/Write for a block currently being written-back I/O Path Design & Implementation A specific cache implementation Block-level cache (4KB blocks) Placed “near” the host interface The cache is accessed right after the ISSUE context Direct-mapped, write-back + write-allocate Supports partial hits/misses (for multi-block I/Os) Locking at the granularity of individual blocks 52 Avoid “stall” upon block misses I/O Path Design & Implementation I/O Stack in DARC - “DAta pRotection Controller” User-Level Applications System Calls Virtual File System (VFS) File System Buffer Cache Raw I/O Block-level Device Drivers SCSI Layer Storage Controller 53 SYSTOR 2010 - DARC MS Windows Host S/W Stack •ScsiPort: half-duplex •StorPort: full-duplex Direct manipulation of SCSI CDBs 54 SYSTOR 2010 - DARC Half-Duplex: ScsiPort 55 SYSTOR 2010 - DARC Full-duplex: StorPort 56 SYSTOR 2010 - DARC