Presentation

Storage Aggregation for
Performance & Availability:
The Path from Physical RAID to Virtual Objects
Garth Gibson
Co-Founder & CTO, Panasas Inc.
Assoc. Professor, Carnegie Mellon University
November 24, 2004
Changing Computational Architecture
Monolithic
Supercomputers
Linux Clusters
Specialized, but expensive
Powerful, scalable, affordable
Price/performance: often > $100M/TFLOPS
Price/performance: often < $1M/TFLOPS
Clusters dominating
Top500 Supercomputers:
1998:
2002:
2004:
Commodity Clusters 2004
Page 2
2
94
294
Source: Top500.org
G. Gibson
Matching to Storage Architecture
Traditional Computing
Cluster Computing
Linux
Compute
Cluster
Monolithic
Computers
Issues
Complex Scaling
Limited Bandwidth
I/O Bottleneck
Inflexible
Expensive
Monolithic
Storage
Commodity Clusters 2004
Page 3
Parallel
data
paths
Single data
path
?
Scale to bigger box?
Scale:
file & total bandwidth
file & total capacity
load & capacity balancing
But lower $ / Gbps
G. Gibson
Next Generation Cluster Storage
Scalable performance
ActiveScale Storage Cluster
Offloaded data path enable direct disk to
client access
Linux
Compute
Cluster
Scale clients, network and capacity
As capacity grows, performance grows
Simplified and dynamic management
Robust, shared file access by many clients
Seamless growth within single namespace
eliminates time-consuming admin tasks
Single Step:
Perform job directly
from high I/O Panasas
Storage Cluster
Parallel
data
paths
Control
path
Integrated HW/SW solution
Optimizes performance and manageability
Ease of integration and support
Metadata
Managers
Commodity Clusters 2004
Page 4
Object Storage
Devices
G. Gibson
Redundant Arrays of
Inexpensive Disks
(RAID)
November 24, 2004
Birth of RAID (1986-1991)
Member of 4th Berkeley RISC CPU design team (SPUR: 84-89)
Dave Patterson decides CPU design is a “solved” problem
Sends me to figure out how storage plays in SYSTEM PERFORMANCE
IBM 3380 disk is 4 arms in a 7.5 GB washing machine box
SLED: Single Large Expensive Disk
New PC industry demands cost
effective 100 MB 3.5” disks
Enabled by new SCSI
embedded controller architecture
Use many PC disks for parallelism
SIGMOD88: A case for RAID
PS. $10-20 per MB (~1000X now)
100 MB/arm (~1000X now)
20-30 IO/sec/arm (5X now)
Commodity Clusters 2004
Page 6
G. Gibson
But RAID is really about Availability
Arrays have more Hard Disk Assemblies (HDAs) -- more failures
Apply replication and/or error/erasure detection codes
Mirroring wastes 50% space; RAID wastes 1/N
Mirroring halves, RAID 5 quarters small write bandwidth
Commodity Clusters 2004
Page 7
G. Gibson
Off to CMU & More Availability
Parity Declustering “spreads RAID groups” to reduce MTTR
Each parity disk block protects fewer than all data disk blocks (C)
Virtualizing RAID group lessens recovery work
Faster recovery or better user response time during recovery or mixture of both
RAID over X?
X = Independent fault domains
“Disk” is easiest “X”
Parity declustering
is my first step in
RAID virtualization
Commodity Clusters 2004
Page 8
G. Gibson
Network-Attached Secure Disks
(NASD, 95-99)
November 24, 2004
Storage Interconnect Evolution
Outboard circuitry increases over time (VLSI density)
Hardware (#hosts, #disks, #paths) sharing increases over time
Logical (information) sharing limited by host SW
1995: Fibrechannel packetizes SCSI over a near general network
Commodity Clusters 2004
Page 10
G. Gibson
Storage as First Class Network Component
Direct transfer between client and storage
Exploit scalable switched cluster area networking
Split file service into: primitives (in drive) and policies (in manager)
Commodity Clusters 2004
Page 11
G. Gibson
NASD Architecture
Before NASD there was store&forward Server-Attached Disks (SAD)
Move access control, consistency out-of-band and cache decisions
Raise storage abstraction: encapsulate layout, offload data access
Commodity Clusters 2004
Page 12
G. Gibson
Metadata Performance
Command processing of most operations in storage could
offload 90% of small file/productivity workload from servers
Key inband attribute updates: size, timestamps etc
NFS
Operation
Count in top
2% by work
(K)
Cycles (B)
% of SAD
Cycles (B)
% of SAD
Cycles (B)
% of SAD
Attr Read
792.7
26.4
11.8
26.4
11.8
0.0
0.0
Attr Write
10.0
0.6
0.3
0.6
0.3
0.6
0.3
Data Read
803.2
70.4
31.6
26.8
12.0
0.0
0.0
Data Write
228.4
43.2
19.4
7.6
3.4
0.0
0.0
Dir Read
1577.2
79.1
35.5
79.1
35.5
0.0
0.0
Dir RW
28.7
2.3
1.0
2.3
1.0
2.3
1.0
Delete Write
7.0
0.9
0.4
0.9
0.4
0.9
0.4
Open
95.2
0.0
0.0
0.0
0.0
12.2
5.5
Total
3542.4
223.1
100
143.9
64.5
16.1
7.2
Commodity Clusters 2004
Page 13
File Server (SAD)
DMA (NetSCSI)
Object (NASD)
G. Gibson
Fine Grain Access Enforcement
State of art is VPN of all out-of-band clients, all sharable data and metadata
Accident prone & vulnerable to subverted client; analogy to single-address space computing
Private Communication
NASD Integrity/Privacy
File manager
1: Request for access
2: CapArgs, CapKey
Secret Key
Object Storage
uses a digitally
signed, objectspecific capabilities
on each request
NASD
Secret Key
Commodity Clusters 2004
Page 14
CapKey= MACSecretKey(CapArgs)
CapArgs= ObjID, Version, Rights, Expiry,....
Client
ReqMAC = MACCapKey(Req,NonceIn)
3: CapArgs, Req, NonceIn, ReqMAC
4: Reply, NonceOut, ReplyMAC
ReplyMAC = MACCapKey (Reply,NonceOut)
G. Gibson
Scalable File System Taxonomy
November 24, 2004
Today’s Ubiquitous NFS
ADVANTAGES
DISADVANTAGES
Familiar, stable & reliable
Capacity doesn’t scale
Widely supported by vendors
Bandwidth doesn’t scale
Competitive market
Cluster by customer-exposed
namespace partitioning
File Servers
Clients
Storage Net
Host Net
Commodity Clusters 2004
Page 16
Disk arrays
Exported SubFile System
G. Gibson
Scale Out w/ Forwarding Servers
Bind many file servers into single system image with forwarding
Mount point binding less relevant, allows DNS-style balancing, more manageable
Control and data traverse mount point path (in band) passing through two servers
Single file and single file system bandwidth limited by backend server & storage
Tricord, Spinnaker
File Server Cluster
Clients
Disk arrays
Host Net
Commodity Clusters 2004
Page 17
Storage Net
G. Gibson
Scale Out FS w/ Out-of-Band
Client sees many storage addresses, accesses in parallel
Zero file servers in data path allows high bandwidth thru scalable networking
E.g.: IBM SanFS, EMC HighRoad, SGI CXFS, Panasas, Lustre, etc
Mostly built on block-based SANs where servers trust all clients
Clients
Storage
File Servers
Commodity Clusters 2004
Page 18
G. Gibson
Object Storage Standards
November 24, 2004
Object Storage Architecture
An evolutionary improvement to standard SCSI storage interface (OSD)
Offload most data path work from server to intelligent storage
Finer granularity of security: protect & manage one file at a time
Raises level of abstraction: Object is container for “related” data
Storage understands how different blocks of a “file” are related -> self-management
Per Object Extensible Attributes is key expansion of functionality
Block Based Disk
Operations:
Object Based Disk
Operations:
Read block
Write block
Create object
Delete object
Read object
Write object
Addressing:
Addressing:
[object, byte range]
Block range
Allocation:
Allocation:
External
Commodity Clusters 2004
Page 20
Security At
Volume Level
Internal
Security At
Source: Intel
Object Level
G. Gibson
OSD is now an ANSI Standard
1995 1996
1997
1998
CMU NASD
NSIC NASD
1999
2000
2001
2002
2003
2004
2005
Lustre
OSD
market
SNIA/T10 OSD
Panasas
INCITS ratified T10’s OSD v1.0 SCSI command set standard, ANSI will publish
Co-chaired by IBM and Seagate, protocol is a general framework (transport independent)
Sub-committee leadership includes IBM, Seagate, Panasas, HP, Veritas, ENDL
Product plans from HP/Lustre & Panasas; research projects at IBM, Seagate
www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf
Commodity Clusters 2004
Page 21
G. Gibson
ActiveScale Storage Cluster
November 24, 2004
Object Storage Systems
Expect wide variety of Object Storage Devices
Disk array subsystem
“Smart” disk for objects
Prototype Seagate OSD
Ie. LLNL with Lustre
2 SATA disks – 240/500 GB
Highly integrated, single disk
16-Port GE
Switch Blade
Orchestrates system activity
Balances objects across OSDs
Commodity Clusters 2004
Page 23
Stores up to 5 TBs per shelf
4 Gbps per
shelf to cluster
G. Gibson
Scalable Storage Cluster Architecture
Lesson of compute clusters: Scale out commodity components
Blade server approach provides
High volumetric density, disk array abstraction
Incremental growth, pay-as-you-grow model
Needs single system image SW architecture
StorageBlade
2 SATA spindles
Commodity Clusters 2004
Page 24
Shelf of Blades
5 TB, 4 Gbps
Single System Image
55 TB, 44 Gbps per rack
G. Gibson
Virtual Objects are Scalable
Scale capacity, bandwidth, reliability by striping according to small map
File
Comprised of:
User Data
Attributes
Layout
DATA
Scalable
Scalable Object
Object Map
Map
1.
1. Purple
Purple OSD
OSD &
& Object
Object
2.
2. Gold
Gold OSD
OSD &
& Object
Object
3.
3. Red
Red OSD
OSD &
& Object
Object
Plus
Plus stripe
stripe size,
size, RAID
RAID level
level
Commodity Clusters 2004
Page 25
G. Gibson
Object Storage Bandwidth
Scalable Bandwidth demonstrated with GE switching
12
10
GB/sec
8
6
4
2
0
0
50
100
150
200
Object Storage Devices
Commodity Clusters 2004
Page 26
250
300
350
Lab results
G. Gibson
ActiveScale SW Architecture
Realm & Performance Mgrs + Web Mgmt Server
DirectFLOW Client
RAID
Protocol Servers
Mgr DB
NFS
NFS
NFS
UNIX
POSIX App
Commodity Clusters 2004
Page 27
CIFS
OSD / iSCSI
TCP/IP
DirectFLOW
NTP +
DHCP
Server
Mgmt
Agent
OSD / iSCSI
TCP/IP
CIFS
Windows
NT App
DirectFLOW
RPC
DirectFLOW
DirectFLOW
Buffer Cache
CIFS
File
File
File
Mgr
Mgr
Mgr
Quota
Quota
Quota
Stor
Stor
Stor
Mgr
Mgr
Mgr
Mgr
Mgr
Mgr
DirectFLOW
Virtual Sub Mgrs
DFLOW fs
DFLOW fs
RAID
0
DFLOW
fs
RAID 0
RAID 0
Zero
Zero NVRAM
NVRAM
Copy
Zero Cache
NVRAM
Copy
Cache
Copy
Cache
OSD / iSCSI
TCP/IP
DirectFLOW
RAID
Linux
Local
Buffer
Cache
POSIX App
G. Gibson
VFS
Fault Tolerance
Overall up/down state of blades
Subset of managers track overall state with heartbeats
Maintain identical state with quorum/consensus
Per file RAID: no parity for unused capacity
RAID level per file; small files mirror; RAID5 for large files
First step toward policy quality of storage associated w/ data
Client based RAID: do XOR where all data sits in memory
Traditional RAID stripes have data of multiple files & metadata
Per file RAID covers only data of one file
Client computed RAID risks only data client can trash anyway
Client memory is most efficient place to compute XOR
Commodity Clusters 2004
Page 28
G. Gibson
Manageable Storage Clusters
Snapshots: consistency for copying, backing up
Copy-on-write duplication of contents of objects
Named as “…/.snapshot/JulianSnapTimestamp/filename”
Snaps can be scheduled, auto-deleted
Soft volumes: grow management without physical constraints
Volumes can be quota bounded, unbounded, or just send email on threshold
Multiple volumes can share space of a set of shelves (double disk failure domain)
Capacity and load balancing: seamless use of growing set of blades
All blades track capacity & load; manager aggregates & ages utilization metrics
Unbalanced systems influence allocation; can trigger moves
Adding a blade simply makes a system unbalanced for awhile
Commodity Clusters 2004
Page 29
G. Gibson
Out-of-band & Clustered NAS
Commodity Clusters 2004
Page 30
G. Gibson
Performance & Scalability for All
Objects: breakthrough data throughput AND random I/O
Source: SPEC.org & Panasas
Commodity Clusters 2004
Page 31
G. Gibson
ActiveScale In Practice
November 24, 2004
Panasas Solution Getting Traction
Wins in HPC labs, seismic processing, biotech & rendering
“We are extremely pleased with the order of
magnitude performance gains achieved by the
Panasas system…with the Panasas system, we
were able to get everything we needed and
more.”
Tony Katz
Manager, IT
TGS Imaging
“The system is blazing fast, we’ve been able to
eliminate our I/O bottleneck so researchers can
analyze data more quickly. The product is ‘plugand-play’ at all levels. ”
Dr. Terry Gaasterland
Associate Professor
Gaasterland Laboratory
of Computational Genomics
Top Seismic
Processing
Company
Commodity Clusters 2004
Page 33
“We looked everywhere for a solution that could
deliver exceptional per-shelf performance.
Finally we found a system that wouldn’t choke
on our bandwidth requirements”
Mark Smith
President
MoveDigital
Leading Animation
/ Entertainment
Company
G. Gibson
Panasas in Action: LANL
Los Alamos Nat Lab: Seeking a Balanced System
Computing Speed
Memory
100
TBs
2006
1
‘01
1
0.1 0.1
TB/sec
30
3
0.3
‘97
‘96
0.1
1
1
10
Parallel
100 I/O
100
TBs
NFS as
Cluster
FS
GB/sec
1
TFLOP/s
100
Memory
Year
10
10
Memory
BW
300
Computing Speed
TFLOP/s
100
10
1000
Disk
TBs
Commodity Clusters 2004
Page 34
2006
1
‘01
1
Memory
BW 300
0.1 0.1
30
3
0.3
TB/sec
‘97
‘96
0.1
1
10
100
Year
10
Scalable
Cluster
FS
Parallel
1
10
1
100 I/O
GB/sec
10
102
100
103
105
Poor
Poor
Application
Application
Throughput:
Throughput:
Too
TooLittle
LittleBW
BW
1000
Disk
TBs
102
103
105
Balanced
Balanced
Application
Application
Throughput
Throughput
G. Gibson
Los Alamos Lightning*
1400 nodes and 60TB (120 TB): Ability to deliver ~ 3 GB/s* (~6 GB/s)
8
8
8
8
8
Panasas
12 shelves
4
4
4
4
4
4
4
4
4
4
4
Switch
Lightning
1400 nodes
Commodity Clusters 2004
Page 35
* entering production
G. Gibson
4
Pink: A Non-GE Cluster
Non-GE Cluster Interconnects for high bandwidth, low latency
LANL Pink’s 1024 nodes use Myrinet; others use Infiniband or Quadrics
Route storage traffic (iSCSI) through cluster interconnect
Via IO routers (1 per 16 nodes in Pink)
Lower GE NIC & wire costs; Lower bisection BW in GE switches (possibly no GE switches)
Linux load balancing, OSPF & Equal Cost Multi-Path for route load balancing and failover
Integrate IO node into multi-protocol switch port
E.g. Topspin, Voltaire, Myricom GE line cards head in this direction
0
0
1016
•
Pink’s
• Compute
•
Nodes
7
GE
•
IO
• Routers
•
GM
56
1023
Commodity Clusters 2004
Page 36
7
63
0
•
•
•
GE
7
G. Gibson
Parallel NFS Possible Future
November 24, 2004
Out-of-Band Interoperability Issues
ADVANTAGES
DISADVANTAGES
Capacity scales
Requires client kernel addition
Bandwidth scales
Many non-interoperable solutions
Not necessarily able to replace NFS
EXAMPLE FEATURES
POSIX Plus & Minus
Clients
Global mount point
Storage
Fault tolerant cache coherence
RAID 0, 1, 5 & snapshots
Distributed metadata and
online growth, upgrade
Vendor X
Kernel Patch/RPM
Commodity Clusters 2004
Page 38
Vendor X
File Servers
G. Gibson
File Systems Standards: Parallel NFS
IETF NFSv4 initiative
Client Apps
U. Michigan, NetApp, Sun, EMC, IBM,
Panasas, ….
pNFS IFS
Enable parallel transfer in NFS
IETF pNFS Documents:
draft-gibson-pnfs-problem-statement-01.txt
Disk
driver
NFSv4 extended
w/ orthogonal
“disk” metadata
attributes
pNFS
1. SBC (blocks)
2. OSD (objects)
3. NFS (files)
draft-gibson-pnfs-reqs-00.txt
draft-welch-pnfs-ops-00.txt
pNFS server
“disk” metadata
grant & revoke
Local File
system
Commodity Clusters 2004
Page 39
G. Gibson
Cluster Storage for
Scalable Linux Clusters
Garth Gibson
[email protected]
www.panasas.com
November 24, 2004
BACKUP
November 24, 2004
BladeServer Storage Cluster
Integrated GE Switch
Battery Module
(2 Power units)
Shelf Front
1 DB, 10 SB
Shelf Rear
DirectorBlade
StorageBlade
Midplane routes GE, power
Commodity Clusters 2004
Page 42
G. Gibson