Storage Aggregation for Performance & Availability: The Path from Physical RAID to Virtual Objects Garth Gibson Co-Founder & CTO, Panasas Inc. Assoc. Professor, Carnegie Mellon University November 24, 2004 Changing Computational Architecture Monolithic Supercomputers Linux Clusters Specialized, but expensive Powerful, scalable, affordable Price/performance: often > $100M/TFLOPS Price/performance: often < $1M/TFLOPS Clusters dominating Top500 Supercomputers: 1998: 2002: 2004: Commodity Clusters 2004 Page 2 2 94 294 Source: Top500.org G. Gibson Matching to Storage Architecture Traditional Computing Cluster Computing Linux Compute Cluster Monolithic Computers Issues Complex Scaling Limited Bandwidth I/O Bottleneck Inflexible Expensive Monolithic Storage Commodity Clusters 2004 Page 3 Parallel data paths Single data path ? Scale to bigger box? Scale: file & total bandwidth file & total capacity load & capacity balancing But lower $ / Gbps G. Gibson Next Generation Cluster Storage Scalable performance ActiveScale Storage Cluster Offloaded data path enable direct disk to client access Linux Compute Cluster Scale clients, network and capacity As capacity grows, performance grows Simplified and dynamic management Robust, shared file access by many clients Seamless growth within single namespace eliminates time-consuming admin tasks Single Step: Perform job directly from high I/O Panasas Storage Cluster Parallel data paths Control path Integrated HW/SW solution Optimizes performance and manageability Ease of integration and support Metadata Managers Commodity Clusters 2004 Page 4 Object Storage Devices G. Gibson Redundant Arrays of Inexpensive Disks (RAID) November 24, 2004 Birth of RAID (1986-1991) Member of 4th Berkeley RISC CPU design team (SPUR: 84-89) Dave Patterson decides CPU design is a “solved” problem Sends me to figure out how storage plays in SYSTEM PERFORMANCE IBM 3380 disk is 4 arms in a 7.5 GB washing machine box SLED: Single Large Expensive Disk New PC industry demands cost effective 100 MB 3.5” disks Enabled by new SCSI embedded controller architecture Use many PC disks for parallelism SIGMOD88: A case for RAID PS. $10-20 per MB (~1000X now) 100 MB/arm (~1000X now) 20-30 IO/sec/arm (5X now) Commodity Clusters 2004 Page 6 G. Gibson But RAID is really about Availability Arrays have more Hard Disk Assemblies (HDAs) -- more failures Apply replication and/or error/erasure detection codes Mirroring wastes 50% space; RAID wastes 1/N Mirroring halves, RAID 5 quarters small write bandwidth Commodity Clusters 2004 Page 7 G. Gibson Off to CMU & More Availability Parity Declustering “spreads RAID groups” to reduce MTTR Each parity disk block protects fewer than all data disk blocks (C) Virtualizing RAID group lessens recovery work Faster recovery or better user response time during recovery or mixture of both RAID over X? X = Independent fault domains “Disk” is easiest “X” Parity declustering is my first step in RAID virtualization Commodity Clusters 2004 Page 8 G. Gibson Network-Attached Secure Disks (NASD, 95-99) November 24, 2004 Storage Interconnect Evolution Outboard circuitry increases over time (VLSI density) Hardware (#hosts, #disks, #paths) sharing increases over time Logical (information) sharing limited by host SW 1995: Fibrechannel packetizes SCSI over a near general network Commodity Clusters 2004 Page 10 G. Gibson Storage as First Class Network Component Direct transfer between client and storage Exploit scalable switched cluster area networking Split file service into: primitives (in drive) and policies (in manager) Commodity Clusters 2004 Page 11 G. Gibson NASD Architecture Before NASD there was store&forward Server-Attached Disks (SAD) Move access control, consistency out-of-band and cache decisions Raise storage abstraction: encapsulate layout, offload data access Commodity Clusters 2004 Page 12 G. Gibson Metadata Performance Command processing of most operations in storage could offload 90% of small file/productivity workload from servers Key inband attribute updates: size, timestamps etc NFS Operation Count in top 2% by work (K) Cycles (B) % of SAD Cycles (B) % of SAD Cycles (B) % of SAD Attr Read 792.7 26.4 11.8 26.4 11.8 0.0 0.0 Attr Write 10.0 0.6 0.3 0.6 0.3 0.6 0.3 Data Read 803.2 70.4 31.6 26.8 12.0 0.0 0.0 Data Write 228.4 43.2 19.4 7.6 3.4 0.0 0.0 Dir Read 1577.2 79.1 35.5 79.1 35.5 0.0 0.0 Dir RW 28.7 2.3 1.0 2.3 1.0 2.3 1.0 Delete Write 7.0 0.9 0.4 0.9 0.4 0.9 0.4 Open 95.2 0.0 0.0 0.0 0.0 12.2 5.5 Total 3542.4 223.1 100 143.9 64.5 16.1 7.2 Commodity Clusters 2004 Page 13 File Server (SAD) DMA (NetSCSI) Object (NASD) G. Gibson Fine Grain Access Enforcement State of art is VPN of all out-of-band clients, all sharable data and metadata Accident prone & vulnerable to subverted client; analogy to single-address space computing Private Communication NASD Integrity/Privacy File manager 1: Request for access 2: CapArgs, CapKey Secret Key Object Storage uses a digitally signed, objectspecific capabilities on each request NASD Secret Key Commodity Clusters 2004 Page 14 CapKey= MACSecretKey(CapArgs) CapArgs= ObjID, Version, Rights, Expiry,.... Client ReqMAC = MACCapKey(Req,NonceIn) 3: CapArgs, Req, NonceIn, ReqMAC 4: Reply, NonceOut, ReplyMAC ReplyMAC = MACCapKey (Reply,NonceOut) G. Gibson Scalable File System Taxonomy November 24, 2004 Today’s Ubiquitous NFS ADVANTAGES DISADVANTAGES Familiar, stable & reliable Capacity doesn’t scale Widely supported by vendors Bandwidth doesn’t scale Competitive market Cluster by customer-exposed namespace partitioning File Servers Clients Storage Net Host Net Commodity Clusters 2004 Page 16 Disk arrays Exported SubFile System G. Gibson Scale Out w/ Forwarding Servers Bind many file servers into single system image with forwarding Mount point binding less relevant, allows DNS-style balancing, more manageable Control and data traverse mount point path (in band) passing through two servers Single file and single file system bandwidth limited by backend server & storage Tricord, Spinnaker File Server Cluster Clients Disk arrays Host Net Commodity Clusters 2004 Page 17 Storage Net G. Gibson Scale Out FS w/ Out-of-Band Client sees many storage addresses, accesses in parallel Zero file servers in data path allows high bandwidth thru scalable networking E.g.: IBM SanFS, EMC HighRoad, SGI CXFS, Panasas, Lustre, etc Mostly built on block-based SANs where servers trust all clients Clients Storage File Servers Commodity Clusters 2004 Page 18 G. Gibson Object Storage Standards November 24, 2004 Object Storage Architecture An evolutionary improvement to standard SCSI storage interface (OSD) Offload most data path work from server to intelligent storage Finer granularity of security: protect & manage one file at a time Raises level of abstraction: Object is container for “related” data Storage understands how different blocks of a “file” are related -> self-management Per Object Extensible Attributes is key expansion of functionality Block Based Disk Operations: Object Based Disk Operations: Read block Write block Create object Delete object Read object Write object Addressing: Addressing: [object, byte range] Block range Allocation: Allocation: External Commodity Clusters 2004 Page 20 Security At Volume Level Internal Security At Source: Intel Object Level G. Gibson OSD is now an ANSI Standard 1995 1996 1997 1998 CMU NASD NSIC NASD 1999 2000 2001 2002 2003 2004 2005 Lustre OSD market SNIA/T10 OSD Panasas INCITS ratified T10’s OSD v1.0 SCSI command set standard, ANSI will publish Co-chaired by IBM and Seagate, protocol is a general framework (transport independent) Sub-committee leadership includes IBM, Seagate, Panasas, HP, Veritas, ENDL Product plans from HP/Lustre & Panasas; research projects at IBM, Seagate www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf Commodity Clusters 2004 Page 21 G. Gibson ActiveScale Storage Cluster November 24, 2004 Object Storage Systems Expect wide variety of Object Storage Devices Disk array subsystem “Smart” disk for objects Prototype Seagate OSD Ie. LLNL with Lustre 2 SATA disks – 240/500 GB Highly integrated, single disk 16-Port GE Switch Blade Orchestrates system activity Balances objects across OSDs Commodity Clusters 2004 Page 23 Stores up to 5 TBs per shelf 4 Gbps per shelf to cluster G. Gibson Scalable Storage Cluster Architecture Lesson of compute clusters: Scale out commodity components Blade server approach provides High volumetric density, disk array abstraction Incremental growth, pay-as-you-grow model Needs single system image SW architecture StorageBlade 2 SATA spindles Commodity Clusters 2004 Page 24 Shelf of Blades 5 TB, 4 Gbps Single System Image 55 TB, 44 Gbps per rack G. Gibson Virtual Objects are Scalable Scale capacity, bandwidth, reliability by striping according to small map File Comprised of: User Data Attributes Layout DATA Scalable Scalable Object Object Map Map 1. 1. Purple Purple OSD OSD & & Object Object 2. 2. Gold Gold OSD OSD & & Object Object 3. 3. Red Red OSD OSD & & Object Object Plus Plus stripe stripe size, size, RAID RAID level level Commodity Clusters 2004 Page 25 G. Gibson Object Storage Bandwidth Scalable Bandwidth demonstrated with GE switching 12 10 GB/sec 8 6 4 2 0 0 50 100 150 200 Object Storage Devices Commodity Clusters 2004 Page 26 250 300 350 Lab results G. Gibson ActiveScale SW Architecture Realm & Performance Mgrs + Web Mgmt Server DirectFLOW Client RAID Protocol Servers Mgr DB NFS NFS NFS UNIX POSIX App Commodity Clusters 2004 Page 27 CIFS OSD / iSCSI TCP/IP DirectFLOW NTP + DHCP Server Mgmt Agent OSD / iSCSI TCP/IP CIFS Windows NT App DirectFLOW RPC DirectFLOW DirectFLOW Buffer Cache CIFS File File File Mgr Mgr Mgr Quota Quota Quota Stor Stor Stor Mgr Mgr Mgr Mgr Mgr Mgr DirectFLOW Virtual Sub Mgrs DFLOW fs DFLOW fs RAID 0 DFLOW fs RAID 0 RAID 0 Zero Zero NVRAM NVRAM Copy Zero Cache NVRAM Copy Cache Copy Cache OSD / iSCSI TCP/IP DirectFLOW RAID Linux Local Buffer Cache POSIX App G. Gibson VFS Fault Tolerance Overall up/down state of blades Subset of managers track overall state with heartbeats Maintain identical state with quorum/consensus Per file RAID: no parity for unused capacity RAID level per file; small files mirror; RAID5 for large files First step toward policy quality of storage associated w/ data Client based RAID: do XOR where all data sits in memory Traditional RAID stripes have data of multiple files & metadata Per file RAID covers only data of one file Client computed RAID risks only data client can trash anyway Client memory is most efficient place to compute XOR Commodity Clusters 2004 Page 28 G. Gibson Manageable Storage Clusters Snapshots: consistency for copying, backing up Copy-on-write duplication of contents of objects Named as “…/.snapshot/JulianSnapTimestamp/filename” Snaps can be scheduled, auto-deleted Soft volumes: grow management without physical constraints Volumes can be quota bounded, unbounded, or just send email on threshold Multiple volumes can share space of a set of shelves (double disk failure domain) Capacity and load balancing: seamless use of growing set of blades All blades track capacity & load; manager aggregates & ages utilization metrics Unbalanced systems influence allocation; can trigger moves Adding a blade simply makes a system unbalanced for awhile Commodity Clusters 2004 Page 29 G. Gibson Out-of-band & Clustered NAS Commodity Clusters 2004 Page 30 G. Gibson Performance & Scalability for All Objects: breakthrough data throughput AND random I/O Source: SPEC.org & Panasas Commodity Clusters 2004 Page 31 G. Gibson ActiveScale In Practice November 24, 2004 Panasas Solution Getting Traction Wins in HPC labs, seismic processing, biotech & rendering “We are extremely pleased with the order of magnitude performance gains achieved by the Panasas system…with the Panasas system, we were able to get everything we needed and more.” Tony Katz Manager, IT TGS Imaging “The system is blazing fast, we’ve been able to eliminate our I/O bottleneck so researchers can analyze data more quickly. The product is ‘plugand-play’ at all levels. ” Dr. Terry Gaasterland Associate Professor Gaasterland Laboratory of Computational Genomics Top Seismic Processing Company Commodity Clusters 2004 Page 33 “We looked everywhere for a solution that could deliver exceptional per-shelf performance. Finally we found a system that wouldn’t choke on our bandwidth requirements” Mark Smith President MoveDigital Leading Animation / Entertainment Company G. Gibson Panasas in Action: LANL Los Alamos Nat Lab: Seeking a Balanced System Computing Speed Memory 100 TBs 2006 1 ‘01 1 0.1 0.1 TB/sec 30 3 0.3 ‘97 ‘96 0.1 1 1 10 Parallel 100 I/O 100 TBs NFS as Cluster FS GB/sec 1 TFLOP/s 100 Memory Year 10 10 Memory BW 300 Computing Speed TFLOP/s 100 10 1000 Disk TBs Commodity Clusters 2004 Page 34 2006 1 ‘01 1 Memory BW 300 0.1 0.1 30 3 0.3 TB/sec ‘97 ‘96 0.1 1 10 100 Year 10 Scalable Cluster FS Parallel 1 10 1 100 I/O GB/sec 10 102 100 103 105 Poor Poor Application Application Throughput: Throughput: Too TooLittle LittleBW BW 1000 Disk TBs 102 103 105 Balanced Balanced Application Application Throughput Throughput G. Gibson Los Alamos Lightning* 1400 nodes and 60TB (120 TB): Ability to deliver ~ 3 GB/s* (~6 GB/s) 8 8 8 8 8 Panasas 12 shelves 4 4 4 4 4 4 4 4 4 4 4 Switch Lightning 1400 nodes Commodity Clusters 2004 Page 35 * entering production G. Gibson 4 Pink: A Non-GE Cluster Non-GE Cluster Interconnects for high bandwidth, low latency LANL Pink’s 1024 nodes use Myrinet; others use Infiniband or Quadrics Route storage traffic (iSCSI) through cluster interconnect Via IO routers (1 per 16 nodes in Pink) Lower GE NIC & wire costs; Lower bisection BW in GE switches (possibly no GE switches) Linux load balancing, OSPF & Equal Cost Multi-Path for route load balancing and failover Integrate IO node into multi-protocol switch port E.g. Topspin, Voltaire, Myricom GE line cards head in this direction 0 0 1016 • Pink’s • Compute • Nodes 7 GE • IO • Routers • GM 56 1023 Commodity Clusters 2004 Page 36 7 63 0 • • • GE 7 G. Gibson Parallel NFS Possible Future November 24, 2004 Out-of-Band Interoperability Issues ADVANTAGES DISADVANTAGES Capacity scales Requires client kernel addition Bandwidth scales Many non-interoperable solutions Not necessarily able to replace NFS EXAMPLE FEATURES POSIX Plus & Minus Clients Global mount point Storage Fault tolerant cache coherence RAID 0, 1, 5 & snapshots Distributed metadata and online growth, upgrade Vendor X Kernel Patch/RPM Commodity Clusters 2004 Page 38 Vendor X File Servers G. Gibson File Systems Standards: Parallel NFS IETF NFSv4 initiative Client Apps U. Michigan, NetApp, Sun, EMC, IBM, Panasas, …. pNFS IFS Enable parallel transfer in NFS IETF pNFS Documents: draft-gibson-pnfs-problem-statement-01.txt Disk driver NFSv4 extended w/ orthogonal “disk” metadata attributes pNFS 1. SBC (blocks) 2. OSD (objects) 3. NFS (files) draft-gibson-pnfs-reqs-00.txt draft-welch-pnfs-ops-00.txt pNFS server “disk” metadata grant & revoke Local File system Commodity Clusters 2004 Page 39 G. Gibson Cluster Storage for Scalable Linux Clusters Garth Gibson [email protected] www.panasas.com November 24, 2004 BACKUP November 24, 2004 BladeServer Storage Cluster Integrated GE Switch Battery Module (2 Power units) Shelf Front 1 DB, 10 SB Shelf Rear DirectorBlade StorageBlade Midplane routes GE, power Commodity Clusters 2004 Page 42 G. Gibson