MDSS: Massive Data Storage Service at Indiana University June 6, 2007 MDSS Defined • MDSS is service for researchers to store and access huge amounts of data. • MDSS is based on High Performance Storage System (HPSS). • HPSS is supported by a consortium of • IBM and several universities and national labs. HPSS is hierarchical storage system (HSM) MDSS Background • MDSS established in 1998 at IU • MDSS has components on the IUPUI • • and IUB campuses. Over 1.8 petabytes of storage available and can scale to 4.2 petabytes In aggregate, MDSS can handle as much as 2GB/s of data flowing in or out. Upgrade Equipment • More and faster tape drives • More and larger tape volumes • Larger disk cache • More and faster servers • Increased network capacity New Functionality • Larger files • New access protocols • New classes of service • Encrypted transfer option • Greater flexibility for IUPUI users User Impact • New hsi, htar, and pftp_client binaries • New firewall entries for new servers • Plain text authentication no longer • • supported Support for DFS dropped New Kerberos keytabs required IU: Two Sites, One System with Remote Movers Indianapolis Clients Bloomington Clients Bloomington Campus WAN Gateway HPSS Core Servers DB2 Metadata Disks LAN Indianapolis Campus • Files from selected classes of service are written to disk or tape at the remote facility. • TCP/IP-based architecture enables remote devices and movers appear to be local to the primary site with no distance limitation. HPSS Movers FC SAN Disk Arrays TCP/IP Wide Area Network Tape Libraries • Also enables mirrored tape files. WAN Gateway LAN Remote Movers FC SAN Disk Arrays Tape Libraries MDSS Design Detailed MDSS Components • 15 IBM P5 575 nodes • • 8GB RAM 2 10Gb ethernet adaptors • • 40 IBM TS1120 (500GB tapes) 12 STK 9940B (200GB tapes) • 150TB of disk cache • 52 tape drives Accessing MDSS • Data goes to disk cache first • Data copied to tape ASAP • Copy sent to other campus by default • Data purged from disk cache as space • needed Data staged back to disk cache as needed Accessing MDSS • Fastest Methods • • • • hsi (hsi.mdss.iu.edu) gridftp (gridftp.mdss.iu.edu) pftp_client (ftp.mdss.iu.edu) kerberized ftp (ftp.mdss.iu.edu) hsi is from Gleicher Enterprises, LLC gridftp is from Argonne National Laboratory Accessing MDSS • Convenient Methods • • • • sftp (sftp.mdss.iu.edu) https (www.mdss.iu.edu) Samba (smb.mdss.iu.eu) hpssfs MDSS Performance • Performance varies from more than • 1GB/s to 2MB/s or less Depends on: • • • • network I/O subsystem protocol stripe width Network Performance • Client network connection (1Gb, 100Mb, • • 10Mb) Building network connection Firewalls Protocol Performance • hsi is best - fast and more options • ftp is fast - no encryption • sftp, samba, and http top out at 1520MB/s • • encryption non-native HPSS interface; gateway Stripe Performance • HPSS can stripe transfers across • • • servers and disks 2, 4, 8, and 16 way stripe widths available For very large files and very fast clients Current candidates - BigRed and Data Capacitor Storing Data • Minimum file size should be one MB • Prefer data to be at 10MB or larger • Maximum file size is 10TB • By default data is mirrored to other campus Class of Service • Each file is assigned to an HPSS Class • of Service (COS) COS defines • • • maximum file size number of copies stripe width • Usually COS selected by user but method varies with tool COS Listing • COS 1 - default; max file size 10GB • COS 2 - max file size 40GB • COS 3 - max file size 10TB • COS 4 - max file size 10TB; 16 way to • disk; 4 way to tape others available Quotas • Default is 1TB • Can request up to 10TB • Greater than 10TB can be done with cost • • sharing Quotas are soft User will be notified after exceeding quota For Further Information please visit the IU MDSS web site http://storage.iu.edu/mdss.html