Presentation

DARC: Design and Evaluation of an I/O
Controller for Data Protection
M. Fountoulakis, M. Marazakis, M. Flouris, and A. Bilas

{mfundul,maraz,flouris,bilas}@ics.forth.gr
Institute of Computer Science (ICS)
Foundation for Research and Technology – Hellas (FORTH)
Ever increasing demand for storage capacity
6X growth
2006: 161 Exabytes
2010: 988 Exabytes
¼ newly created, ¾ replicas
70% created by individuals
95% unstructured
[ source: IDC report on “The Expanding Digital Universe”, 2007 ]
2
SYSTOR 2010 - DARC
Motivation
With increased capacity comes increased probability for
unrecoverable read errors




URE probability ~ 10-15 for FC/SAS drives (10-14 for SATA)
“Silent” errors, i.e. exposed only when data are consumed by
applications – much later than write
Dealing with silent data errors on storage devices becomes critical as
more data are stored on-line, on low-cost disks
Accumulation of data copies (verbatim or minor edits)


Increased probability for human errors
Device-level & controller-level defenses in enterprise storage



Disks with EDC/ECC for stored data (520-byte sectors, background
data-scrubbing)
Storage controllers for continuous data protection (CDP)
What about mainstream systems?


3
example: mid-scale direct-attached storage servers
SYSTOR 2010 - DARC
Our Approach: Data Protection in the Controller
(1) Use persistent checksums for error detection


If error is recovered use second copy of mirror for recovery
(2) Use versioning for dealing with human errors


After failure, revert to previous version
Perform both techniques transparently to



(a) Devices: can use any type of (low-cost) devices
(b) File-system and host OS (only a “thin” driver is needed)
Potential for high-rate I/O



Make use of specialized data-path & hardware resources
Perform (some) computations on data while they are on transit
Offloading work from Host CPUs, making use of specialized
data-path in the controller

4
SYSTOR 2010 - DARC
Technical Challenges: Error Detection
Compute EDC, per data block, on the common I/O path
Maintain persistent EDC per data block
Minimize impact of EDC retrieval
Minimize impact of EDC calculation & comparison
Large amounts of state/control information needs to be
computed, stored, and updated in-line with I/O
processing





5
SYSTOR 2010 - DARC
Technical Challenges: Versioning
Versioning of storage volumes


timeline of volume snapshots
Which blocks belong to each version of a volume?



Maintain persistent data structures that grow with the capacity
of the original volumes
Updated upon each write, accessed for each read as well
Need to sustain high I/O rates for versioned volumes,
keeping a timeline of written blocks & purging blocks
from discarded versions
… while verifying the integrity of the accessed data blocks


6
SYSTOR 2010 - DARC
Outline
Motivation & Challenges
Controller Design






Host-Controller Communication
Buffer Management
Context & Transfer Scheduling
Storage Virtualization Services
Evaluation
Conclusions


7
SYSTOR 2010 - DARC
Host-Controller Communication
Options for transfer of commands


PIO vs DMA
PIO: simple, but with high CPU overhead
DMA: high throughput, but completion detection is
complicated



Options: Polling, Interrupts
I/O commands [ transferred via Host-initiated PIO ]



SCSI command descriptor block + DMA segments
DMA segments reference host-side memory addresses
I/O completions [transferred via Controller-initiated DMA ]


8
Status code + reference to originally issued I/O command
SYSTOR 2010 - DARC
Controller memory use
Use of memory in the controller:




Pages to hold data to be read from storage devices
Pages to hold data being written out by the Host
I/O command descriptors & status information
Overhead of memory mgmt is critical for I/O path



State-tracking “scratch-space” needed per I/O command
Arbitrary sizes may appear in DMA segments


9
Not matching block-level I/O size & alignment restrictions
Dynamic arbitrary-size allocations using Linux APIs are expensive
at high I/O rates
SYSTOR 2010 - DARC
Buffer Management

Buffer pools

Pre-allocated, fixed-size




O(1) allocation/de-allocation overhead
Lazy de-allocation

De-allocate when:


2 classes: 64KB for application data, 4KB for control information
Trade-off between space-efficiency and latency
Idle, or under extreme memory pressure
Command & completion FIFO queues



10
Host-Controller communication
Statically allocated
Fixed size elements
SYSTOR 2010 - DARC
Context Scheduling

Identify I/O path stages




Map stages to threads
Don’t use FSMs: difficult to extend in complex designs
Each stage serves several I/O requests at a time
Explicit thread scheduling

Yield when waiting





Overlap transfers with computation
I/O commands and completions in-flight while device transfers
are being initiated
Avoid starvation/blocking of either side!
No processing in IRQ context
Default fair scheduler vs static FIFO scheduler

11
Yield behavior
SYSTOR 2010 - DARC
I/O Path – WRITE (no cache, CRC)
From Host
ISSUE
work-queue
NEW-WRITE
work-queue
submit_bio()
OLD-WRITE
work-queue
[ CRC compute ]
ADMA channel
SAS/SCSI
controller
I/O Completion
(soft-IRQ handler)
IRQ
Check for DMA
completion
To Host
[ CRC store ]
WRITE-COMPLETION
work-queue
12
SYSTOR 2010 - DARC
I/O Path – READ (no cache, CRC)
From Host
To Host
I/O Completion
(soft-IRQ handler)
ISSUE
work-queue
[ CRC lookup & check ]
READ-COMPLETION
work-queue
OLD-READ
work-queue
NEW-READ
work-queue
[ CRC compute ]
Check for DMA
completion
submit_bio()
ADMA channel
IRQ
SAS/SCSI
controller
13
SYSTOR 2010 - DARC
Storage Virtualization Services

DARC uses the Violin block-driver framework for volume
virtualization & versioning




M. Flouris and A. Bilas – Proc. MSST, 2005
Volume management: RAID-10
EDC checking (32-bit CRC32-C checksum per 4KB)
Versioning


Timeline of snapshots of storage volumes
Persistent data-structures, accessed & updated in-line with each
I/O access:



14
logical-to-physical block map
live-block map
block-version map
SYSTOR 2010 - DARC
Storage Virtualization Layers in DARC Controller
Host-Controller Communication &
I/O Command Processing
Versioning
RAID-0
RAID-1
RAID-1
EDC
EDC
EDC
EDC
/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
15
SYSTOR 2010 - DARC
Block-level metadata issues

Performance




Every read & write request requires metadata lookup
Metadata I/Os are small-sized, random, and synchronous
Can we just store the metadata in memory ?
Memory footprint

For translation tables: 64-bit address per 4KB block  2 GBytes per
TByte of disk-space



Too large to fit in memory!
Solution: metadata cache
Persistence




16
Metadata are critical: losing metadata results in data loss!
Writes induce metadata updates to be written to disk
Only safe way to be persistent is synchronous writes  too slow!
Solutions: journaling, versioning, use of NVRAM, …
SYSTOR 2010 - DARC
What about controller on-board caching ?

Typically, I/O controllers have an on-board data cache:




Many intertwined design decisions needed …




Exploit temporal locality (recently-accessed data blocks)
Read-ahead for spatial locality (prefetch adjacent data blocks)
Coalescing small writes (e.g. partial-stripe updates with RAID-5/6)
RAID levels affect cache implementation:
Performance
Failures (degraded RAID operation)
DARC has a simple block-cache, but it is not enabled in the
evaluation experiments reported in this paper.
 All available memory is used for buffers to hold in-progress
I/O commands, their associated data _and_ metadata for the
data protection functionality.
17
I/O Path Design & Implementation
Outline


Motivation & Challenges
Controller Design





Evaluation




Host-Controller Communication
Buffer Management
Context & Transfer Scheduling
Storage Virtualization Services
IOP348 embedded platform
Micro-measurements & Synthetic I/O patterns
Application Benchmarks
Conclusions
18
SYSTOR 2010 - DARC
Experimental Platform

Intel 81348-based development kit



8 SAS HDDs


Seagate Cheetah 15.5k (15k RPM, 72GB)
Host: MS Windows 2003 Server (32-bit)


2 XScale CPU cores - DRAM: 1GB
Linux 2.6.24 + Intel patches (isc81xx driver)
Tyan S5397, DRAM: 4 GB
Comparison with ARC-1680 SAS controller

19
Same hardware platform as our dev. kit
SYSTOR 2010 - DARC
I/O Stack in DARC - “DAta pRotection Controller”
20
SYSTOR 2010 - DARC
Intel IOP348 Data Path
SRAM
(128 KB)
• DMA engines
• Special-purpose
data-path
• Messaging Unit
21
SYSTOR 2010 - DARC
Intel IOP348
[ Linux 2.6.24 kernel (32-bit) + Intel IOP patches (isc81xx driver) ]
22
SYSTOR 2010 - DARC
“Raw” DMA Throughput
DMA Throughput
host-to-HBA
HBA-to-host
1800
1600
MB/sec
1400
1200
1000
800
4
23
8
16
32
transfer size (KB)
SYSTOR 2010 - DARC
64
Streaming I/O Throughput
RAID-0, IOmeter RS pattern
[ 8 SAS HDDs ]
RS Iometer Pattern
MB/sec
DARC
ARC-1680
DARC (LARGE-SG)
DARC, DFLT ALLOC
1050
900
750
600
450
300
150
0
Throughput collapse!
1
2
4
8
16
queue-depth
24
SYSTOR 2010 - DARC
32
64
IOmeter results: RAID-10, OLTP pattern
OLTP (4KB) Iometer Pattern
queue-depth
ARC-1680
64
16
4
1
0
500
1000
IOPS
25
DARC
SYSTOR 2010 - DARC
1500
2000
IOmeter results: RAID-10, FS pattern
FS Iometer Pattern
queue-depth
ARC-1680
DARC
64
16
4
1
0
500
1000
IOPS
SYSTOR 2010 - Data
pRotection Controller
26
1500
2000
TPC-H (RAID-10, 10-query sequence)
TPCH - Execution Time
seconds

2000
1800
1600
1400
1200
1000
800
600
400
200
0
ARC-1680
DARC, NO-EDC
+12%
DARC, EDC
configuration
27
SYSTOR 2010 - DARC

+2.5%
DARC, EDC,
VERSION
JetStress (RAID-10, 1000 mboxes, 1.0 IOPS per mbox)
JetStress results (IOPS)
ARC-1680, write-through
DARC, EDC
ARC-1680, write-back
DARC, NO-EDC
DARC, EDC, VERSION
Log Volume
Data Volume
Data Volume (WRITE)
Data Volume (READ)
0
28
200
400
600
SYSTOR 2010 - DARC
800
1000
1200
1400
1600
Conclusions

Incorporation of data protection features in a commodity
I/O controller




integrity protection using persistent checksums
versioning of storage volumes
Several challenges in implementing an efficient I/O path
between the host machine & the controller
Based on a prototype implementation, using real hardware:


29
Overhead of EDC checking: 12 - 20%
 Depending on # concurrent I/Os
Overhead of versioning: 2.5 - 5%
 With periodic (frequent) capture & purge
 Depending on number and size of writes
SYSTOR 2010 - DARC
Lessons learned from prototyping effort

CPU overhead at controller is an important limitation




At high I/O rates
We expect CPU to issue/manage more operations on data in
the future
Offload on every opportunity
Essential to be aware of data-path intricacies


To achieve high I/O rates
Overlap transfers efficiently



To/from host
To/from storage devices
Emerging need for handling persistent metadata


30
Along the common I/O path, with increasing complexity
Increased consumption of storage controller resources
SYSTOR 2010 - DARC
Thank you for your attention!
Questions?
“DARC: Design and Evaluation of an I/O Controller for
Data Protection”
Manolis Marazakis, [email protected]
http://www.ics.forth.gr/carv
31
SYSTOR 2010 - DARC
Silent Error Recovery using RAID-1 and CRCs
32
SYSTOR 2010 - DARC
Recovery Protocol Costs
Case
Data I/Os CRC I/Os CRC calc’s Outcome
RAID-1 pair data differ,
CRC matches one block
3
0
2
Data recovery, re-issue
I/O
RAID-1 pair data
identical, CRC does not
match
2
1
2
CRC recovery
RAID-1 pair data differ,
CRC does not match
2
0
2
Data error, Alert Host
33
SYSTOR 2010 - DARC
Selection of Memory Regions

Non-cacheable, no write-combining for



Non-cacheable + write-combining for




controller’s hardware-resources (control registers)
controller outbound PIO to host memory
DMA descriptors
Completion FIFO
Intel SCSI driver command allocations
Cacheable + write-combining

CRCs: allocated along with other data to be processed


Command FIFO

34
explicit cache management
explicit cache management
SYSTOR 2010 - DARC
Completion FIFO
PIO
Command FIFO
35
DMA
PCI Express
Completion FIFO
SYSTOR 2010 - DARC
Command FIFO
Completion FIFO
DMA
dequeue
enqueue
I/O
completions
SCSI
commands
Issue
Thread
Read
Completion
Thread
CRC generation
Block IO
Reads Thread
Issue Path
Integrity Check
Block IO
Writes Thread
SCSI-to-block
Translation
Read DMA
Thread
Completion Path
Storage Services
36
Complete I/O
DMA
Writes DMA
Thread
SYSTOR 2010 - DARC
Write
Completion
Thread
schedule
completion
processing
Interrupt
Context
Prototype Design Summary
Challenge
Design Decision
Host-Controller I/F PIO for commands/completions, DMA for data
Buffer management Pre-allocated buffer pools, lazy de-allocation, fixedsize ring buffers (command/completion FIFOs)
Context scheduling Map stages to work-queues (threads), explicit
scheduling, no processing in IRQ-context
On-board Cache
[ Optional ] for data-blocks, “closest” to host
Data Protection
Violin framework within the Linux kernel:
RAID-10 volumes, versioning (based on re-map),
persistent metadata - including EDC
CRC32-C checksums, computed per-4KB by DMA
engine during transfers, persistently stored (within
dedicated metadata space)
37
SYSTOR 2010 - DARC
Impact of PIO on DMA Throughput
Impact of host-issued PIO on DMA Throughput
host-issued PIO?
2-way
to-host
8KB DMA transfers
from-host
ON
OFF
0
200
400
600
800
1000
1200
MB/sec
38
SYSTOR 2010 - DARC
1400
1600
1800
2000
2200
IOP348 Micro-benchmarks
IOP348 clock cycle
0.833 nsec (1.2 GHz)
Interrupt delay, CTX SW
837 nsec
1004.8 cycles
Memory store
99 nsec
118.8 cycles
Local-bus store
30 nsec
36 cycles
Outbound store (PIO write, to host)
114 nsec
136.8 cycles
Outbound load (PIO read, from host)
674 nsec
809.1 cycles
Outbound load with DMA transfers
3390 ns
4069.6 cycles
Outbound load with DMA transfers and
inbound PIO-Writes from host
5970 ns
7166.8 cycles
Host clock cycle: 0.5 nsec (2.0 GHz)
Host –initiated PIO write: 100 nsec (200 cycles)
39
SYSTOR 2010 - DARC
Impact of Linux Scheduling Policy
[ with PIO completions ]
RS Iometer Pattern
ARC-1680
DMA (to-host)
DARC (FAIR-SCHED)
DARC (FIFO-SCHED)
1800
1600
MB/sec
1400
1200
1000
800
600
400
200
0
1
2
4
8
16
queue-depth
40
SYSTOR 2010 - DARC
32
64
I/O Workloads

IOmeter patterns:
 RS, WS



64KB sequential read/write stream

random 4KB I/O (33% writes)

file-server (random, misc. sizes, 20% writes)
OLTP (4KB)
FS


WEB

web-server (random, misc. sizes, 100% reads)


68% 4KB, 15% 8KB, 2% 16KB, 6% 32KB, 7% 64KB, 1% 128KB, 1%
512KB
Database workload:

TPC-H


80% 4KB, 2% 8KB, 4% 16KB, 4% 32KB, 10% 64KB
(4GB dataset, 10 queries)
Mail server workload:

JetStress (1000 100MB mailboxes, 1.0 IOPS/mbox)

41
25% insert, 10% delete, 50% replace, 15% read
SYSTOR 2010 - DARC
41
Co-operating Contexts (simplified)
ISSUE
SCSI command pickup,
SCSI control commands
SCSI completions
Pre-allocated Buffer Pools
+ Lazy Deallocation
Data for Writes
Data for Reads
DMA from host
DMA to host
BIO
END_IO
block-level I/O issue
SCSI completion to Host
42
SYSTOR 2010 - DARC
Application DMA Channel (ADMA)


Device interface: chain of transfer descriptors
Transfer descriptor := (SRC, DST, byte-count, control-bits)




SRC, DST: physical addresses, at host or controller
Chain of descriptors is held in controller memory
… and may be expanded at run-time
Completion detection:



ADMA channels report (1) running/idle state, and (2) address of the
descriptor for the currently-executing (or last) transfer
Ring-buffer of transfer descriptor IDs: (Transfer Descriptor Address, Epoch)
Reserve/release out-of-order, as DMA transfers complete
•DMA_Descriptor_ID post_DMA_transfer(Host Address,
Controller Address, Direction of Transfer, Size of Transfer,
CRC32C Address)
•Boolean is_DMA_transfer_finished(DMA Descriptor Identifier)
43
SYSTOR 2010 - DARC
Command FIFO: Using DMA
New-head
: valid queue element
head
: element to enqueue
: valid element to dequeue
Host
tail
PCIe interconnect
head
DMA
tail
Controller
New-tail
44
Controller initiates DMA
-Needs to know tail at Host
-Host needs to know head at Controller
SYSTOR 2010 - DARC
Command FIFO: Using PIO
head
: valid queue element
tail
: element already enqueued
Host
pointer
updates
PIO
PCIe interconnect
head
tail
tail
head
Controller
New-tail
45
Host executes PIO-Writes
-Needs to know head at Controller
-Controller needs to know tail at Host
SYSTOR 2010 - DARC
Completion FIFO



PIO is expensive for controller CPU
We use DMA for Completion FIFO queue
Completion transfers can be piggy-backed on data transfers

46
For reads
SYSTOR 2010 - DARC
Command & Completion FIFO Implementation

IOP348 ATU-MU provides circular queues




4 byte elements
Up to 128KB
Significant management overheads
Instead, we implemented FIFOs entirely in software

Memory-mapped across PCIe

47
For DMA and PIO direct access
SYSTOR 2010 - DARC
Context Scheduling

Multiple in-flight I/O commands at any one time


I/O command processing actually proceeds in discrete stages, with
several events/notifications being triggered at each
Option-I: Event-driven


Design (and tune) dedicated FSM
Many events during I/O processing


Option-II: Thread-based


Eg: DMA transfer start/completion, disk I/O start/completion, …
Encapsulate I/O processing stages in threads, schedule threads
We have used Thread-based, using full Linux OS


48
Programmable, infrastructure in-place to build advanced functionality
more easily
… but more s/w layers, with less control over timing of
events/interactions
SYSTOR 2010 - DARC
Scheduling Policy

Threads (work-queues) instead of FSMs



Default Linux scheduler (SCHED_OTHER) is not optimal



Simpler to develop/re-factor code & debug
Can block independently from one another
Threads need to be explicitly pre-empted when polling on a resource
Events are grouped within threads
Custom scheduling, based on SCHED_FIFO policy

Static priorities, no time-slicing (run-until-complete/yield)


All threads at same priority level (strict FIFO), no dynamic thread creation
Thread order precisely follows the I/O path


49
Crucial to understand the exact sequence of events
With explicit yield() when polling, or when "enough" work has been
done - always yield() when a resource is unavailable
SYSTOR 2010 - DARC
Controller On-Board Cache

Typically, I/O controllers have an on-board cache:





Exploit temporal locality (recently-accessed data blocks)
Read-ahead for spatial locality (prefetch adjacent data blocks)
Coalescing small writes (e.g. partial-stripe updates with RAID-5/6)
Many design decisions needed
RAID affects cache implementation


50
Performance
Failures (degraded RAID operation)
I/O Path Design & Implementation
On-board Cache Design Decisions

Placement of the cache


Mapping function & associativity


Near the host interface, near the storage devices
Replacement policy
Handling of writes


Write-back, write-through
Write-allocate, write no-allocate
Handling of partial hits/misses
Concurrency / Contention




Many in-flight requests
Dependencies between pending accesses


Hit-under-miss, mapping conflicts
Contention for individual blocks

51
Cache access involves several steps
(DMA and I/O issue/completion)
E.g: Read/Write for a block currently being written-back
I/O Path Design & Implementation
A specific cache implementation


Block-level cache (4KB blocks)
Placed “near” the host interface



The cache is accessed right after the ISSUE context
Direct-mapped, write-back + write-allocate
Supports partial hits/misses (for multi-block I/Os)

Locking at the granularity of individual blocks

52
Avoid “stall” upon block misses
I/O Path Design & Implementation
I/O Stack in DARC - “DAta pRotection Controller”
User-Level Applications
System Calls
Virtual File System (VFS)
File System
Buffer
Cache
Raw I/O
Block-level Device Drivers
SCSI Layer
Storage Controller
53
SYSTOR 2010 - DARC
MS Windows Host S/W Stack
•ScsiPort: half-duplex
•StorPort: full-duplex
Direct manipulation of SCSI CDBs
54
SYSTOR 2010 - DARC
Half-Duplex: ScsiPort
55
SYSTOR 2010 - DARC
Full-duplex: StorPort
56
SYSTOR 2010 - DARC