Massive Data Storage Service (MDSS)

MDSS:
Massive Data Storage Service
at Indiana University
June 6, 2007
MDSS Defined
• MDSS is service for researchers to store
and access huge amounts of data.
• MDSS is based on High Performance
Storage System (HPSS).
• HPSS is supported by a consortium of
•
IBM and several universities and
national labs.
HPSS is hierarchical storage system
(HSM)
MDSS Background
• MDSS established in 1998 at IU
• MDSS has components on the IUPUI
•
•
and IUB campuses.
Over 1.8 petabytes of storage available
and can scale to 4.2 petabytes
In aggregate, MDSS can handle as
much as 2GB/s of data flowing in or out.
Upgrade Equipment
• More and faster tape drives
• More and larger tape volumes
• Larger disk cache
• More and faster servers
• Increased network capacity
New Functionality
• Larger files
• New access protocols
• New classes of service
• Encrypted transfer option
• Greater flexibility for IUPUI users
User Impact
• New hsi, htar, and pftp_client binaries
• New firewall entries for new servers
• Plain text authentication no longer
•
•
supported
Support for DFS dropped
New Kerberos keytabs required
IU: Two Sites, One System
with Remote Movers
Indianapolis
Clients
Bloomington
Clients
Bloomington
Campus
WAN
Gateway
HPSS Core
Servers
DB2
Metadata
Disks
LAN
Indianapolis
Campus
• Files from selected
classes of service
are written to disk or
tape at the remote
facility.
• TCP/IP-based
architecture enables
remote devices and
movers appear to be
local to the primary
site with no distance
limitation.
HPSS
Movers
FC SAN
Disk Arrays
TCP/IP
Wide Area
Network
Tape
Libraries
• Also enables
mirrored tape files.
WAN
Gateway
LAN
Remote
Movers
FC SAN
Disk Arrays
Tape
Libraries
MDSS Design Detailed
MDSS Components
• 15 IBM P5 575 nodes
•
•
8GB RAM
2 10Gb ethernet adaptors
•
•
40 IBM TS1120 (500GB tapes)
12 STK 9940B (200GB tapes)
• 150TB of disk cache
• 52 tape drives
Accessing MDSS
• Data goes to disk cache first
• Data copied to tape ASAP
• Copy sent to other campus by default
• Data purged from disk cache as space
•
needed
Data staged back to disk cache as
needed
Accessing MDSS
• Fastest Methods
•
•
•
•
hsi (hsi.mdss.iu.edu)
gridftp (gridftp.mdss.iu.edu)
pftp_client (ftp.mdss.iu.edu)
kerberized ftp (ftp.mdss.iu.edu)
hsi is from Gleicher Enterprises, LLC
gridftp is from Argonne National Laboratory
Accessing MDSS
• Convenient Methods
•
•
•
•
sftp (sftp.mdss.iu.edu)
https (www.mdss.iu.edu)
Samba (smb.mdss.iu.eu)
hpssfs
MDSS Performance
• Performance varies from more than
•
1GB/s to 2MB/s or less
Depends on:
•
•
•
•
network
I/O subsystem
protocol
stripe width
Network Performance
• Client network connection (1Gb, 100Mb,
•
•
10Mb)
Building network connection
Firewalls
Protocol Performance
• hsi is best - fast and more options
• ftp is fast - no encryption
• sftp, samba, and http top out at 1520MB/s
•
•
encryption
non-native HPSS interface; gateway
Stripe Performance
• HPSS can stripe transfers across
•
•
•
servers and disks
2, 4, 8, and 16 way stripe widths
available
For very large files and very fast clients
Current candidates - BigRed and Data
Capacitor
Storing Data
• Minimum file size should be one MB
• Prefer data to be at 10MB or larger
• Maximum file size is 10TB
• By default data is mirrored to other
campus
Class of Service
• Each file is assigned to an HPSS Class
•
of Service (COS)
COS defines
•
•
•
maximum file size
number of copies
stripe width
• Usually COS selected by user but
method varies with tool
COS Listing
• COS 1 - default; max file size 10GB
• COS 2 - max file size 40GB
• COS 3 - max file size 10TB
• COS 4 - max file size 10TB; 16 way to
•
disk; 4 way to tape
others available
Quotas
• Default is 1TB
• Can request up to 10TB
• Greater than 10TB can be done with cost
•
•
sharing
Quotas are soft
User will be notified after exceeding quota
For Further Information
please visit the IU MDSS web site
http://storage.iu.edu/mdss.html