Server-Class Energy and Performance Evaluations Erez Zadok [email protected] File systems and Storage Lab Stony Brook University http://green.filesystems.org/ 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) Motivation • For every $1 spent on hardware $0.50 • spent on power and cooling [IDC 2007] Energy use in U.S. data centers = 1–2% of total energy in U.S. [EPA 2007] Growth Rate of 2x per 5 years • Even more outside the data center [Forrester 2008] Build performance- and energy-efficient systems Evaluate the efficacy of file system in achieving this goal 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 2 Overview • Motivation • Related Work • Experimental Methodology • Evaluation Results Machine 1 (M1) Results Machine 2 (M2) Results • Conclusion and Future Work 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 3 Techniques Reduce Pidle Reduce Pdynamic Complementary Right Sizing • • • Hardware-based CPU DVFS Machine ACPI states • • • Work Reduction standby, hibernate, off, etc. Opportunistic spin-down DRPM Virtualization/VMs 05/26/2010 • • • • Software-based Aggregation, Localization Compression, DeDUP Reconfiguration Application/Services File Systems RAID Levels, etc. Erez Zadok invited talk (ACM SYSTOR 2010) 4 Right Sizing Techniques • Techniques to increase disk sleep time Massive Array of Idle disks (MAID) [Colarelli 2002] Popular Data Concentration (PDC) [Pinheiro 2004] Write off-loading [Narayanan 2008] GreenFS [Joukov 2008] Scale down Hadoop clusters [Leverich 2009] 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 5 Work Reduction Techniques • Grouping/replication and prediction FS2 [Huang 2005] EEFS [Li 2006] Predictive Data Grouping [Essary 2008] • Energy-aware prefetching [Manzanares 2006] • Hybrid: Low-powered hardware with intelligent data-structure FAWN 05/26/2010 [Andersen 2009] Erez Zadok invited talk (ACM SYSTOR 2010) 6 Benchmarking Studies • Benchmarks SPECPower Metric: operations/second/watt JouleSort Metric: sortedrecs/joule • Benchmark Studies RAID evaluation [Gurumurthi 2003] Compression evaluation [Kothiyal 2009] 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 7 Overview • Motivation • Related Work • Experimental Methodology • Evaluation Results • Conclusion and Future Work 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 8 Experimental Methodology • Workloads (4) Web server, Database server, File server, Mail server FileBench emulated workloads • File Systems (4) Type: Ext2, Ext3, ReiserFS, XFS Mount Options: noatime, notail, journal=<modes> Format Options: inode size, blocksize, allocation/block group count. • Hardware (2) We ran a total of 248 benchmarks 414 clock hours! 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 9 FileBench • Sun Microsystems, 2005 Used for performance analysis of Solaris OS • Rich language to emulate complex workloads • Provide with a few emulated workloads Application traces Recommend parameters for server workloads • Superior to few other benchmarks E.g., Bonnie, Postmark, Andrew Benchmark, etc. • We maintain/release new version 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 10 FileBench Workloads Server workload Avg. file size Avg. directory depth No. of files I/O size (R/W) No. of threads R/W ratio Mail 16KB FLAT 50,000 1MB/16KB 100 1:1 Database 0.5GB FLAT 10 2KB/2KB 200+10 20:1 Web 32KB 3.3 20,000 1MB/16KB 100 10:1 File 256KB 3.6 50,000 1MB/16KB 100 1:2 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 11 File System Properties Features Ext2 Ext3 ReiserFS XFS Disk Layout Linear Linear B+ Tree B+ Tree Allocation unit / strategy Fixed-sized blocks Fixed-sized blocks Fixed-sized blocks Variablesized extents (Delayed allocation) No. of Files Fixed Fixed Variable Variable Journaling modes None Ordered, writeback, data Ordered, writeback, data, none Writeback Special Feature Block groups Block groups Tail Packing Allocation groups We used CentOS 5.3 Linux 2.6.18-128.1.16.el5.centos.plus 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 12 Hardware Setup Linux Server Server Power Readings (USB) WattsUP Pro ES (server) A/C Power Supply 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 13 Machine Configurations M1 M2 3+ years (2007) < 1 year (2009) CPU Model Intel Xeon Intel Nehalem (E5530) CPU Speed 2.8GHz 2.4GHz 2 dual core 1 quad core No Yes L1 cache size 16KB 128KB L2 cache size 2MB 1MB L3 cache size No 8MB FSB speed 800 MHz 1066 MHz RAM size 2048 MB 24GB (used 2GB) RAM type DIMM DIMM Disk RPM 15K RPM 7.2K RPM SCSI SATA 3.2/3.6 ms 10.5/12.5 ms 8MB 16MB Machine Age # of CPUs DVFS Type of Disk Average Seek Time (ms) Disk Cache 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 14 Overview • Motivation • Related Work • Experimental Methodology • Evaluation Results Machine 1 (M1) Results Machine 2 (M2) Results • Conclusion and Future Work 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 15 Mail Server (M1) Performance 1858 1384 1447 1516 1274 1326 1157 2000 ops/sec 1600 1200 1350 1444 946 554 800 781 940 638 596 319 400 1462 1360 1300 1001 970 965 307 329 328 406 377 326 326 312 1221 0 10000 Energy Efficiency 7717 ops/kjoule 8000 6000 4003 4000 2000 6250 5797 6050 5110 3301 2350 1366 2700 4009 2560 5505 4220 4090 1602 1602 5813 6340 3980 1397 1392 1408 1310 1397 5722 5480 4781 6040 5573 1310 0 Higher is better 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 16 Mail Server (M1) 7% 42% 29% 2000 ops/sec 1600 1200 1350 17– 50% 3.5x 1444 946 781 800 554 940 638 326 319 400 Tail packing on by XFS bottleneck: default – hurting lookup Performancesmall file reads 1858 328 326 329 Ext2 bottleneck: fsync 0 ReiserFS-notail best for this configuration 10000 7717 ops/kjoule 8000 5797 6000 6050 4003 3301 4000 2350 2000 1366 4009 Energy Efficiency 2700 1397 1397 1392 1408 0 Linearity between Performance and Energy Efficiency 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 17 Database Server (M1) Except for Ext2 other default FS perform similarly 2KB block size boosts the efficiency by ~2x Journaling helps in I/O size = Block size random writes 500 429 ops/sec 361 377 402 271 300 217 200 392 30% 400 442 429 182 209 220 100 0 Performance 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 18 442 Web Server (M1) Ext2: lack of journal and common inode updates 2.5x 2.25x • Tail Packing on • Small files fragmented Reiserfs: atime updates take expensive BKL to search ‘stat’ item 9x 8% 22% 90 71 77 1000 ops/sec 2x 74 71 68 61 58 60 30 30 5 8 0 Performance 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 19 File Server (M1) 22 – 28% • Deep directory • Metadata – data mix 4% 37% Large average file size 91% 43% 500 443 443 445 400 ops/sec 325 310 285 300 232 222 298 234 200 100 0 Performance 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 20 File System Selection Matrix (M1) • Newer hardware Different results • Optimal file system often varies with Workload changes in Best File System (Combination) Improvement Range (compared to all default FS) Ops/sec Ops/joule Software XFS (inode-size-1K) 8% – 9.4x 6% – 7.5x File Server Workload ReiserFS (default) 0% – 1.9x 0% – 2.0x Mail Server ReiserFS (notail) 29% – 5.8X 28% – 5.7x Database Server XFS/Ext3 (BLK-2K) 2.0 – 2.4x 2.0 – 2.4x Hardware Web Server This recommendation matters but … 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 21 Overview • Motivation • Related Work • Experimental Methodology • Evaluation Results Machine 1 (M1) Results Machine 2 (M2) Results • Conclusion and Future Work 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 22 Mail Server (M1 vs. M2) 42% 7% 2000 M1 ops/sec 1600 1200 1350 1858 3.5x 1444 946 781 800 554 940 638 326 319 400 M2 vs. M1 Ext2 (M2):35% 2x disk – 3x Memory intensive cache overcomes improvement workload fsync bottleneck for all defaults 328 326 329 0 Best Configs Difference from M1: Increasing M1: Trend Reiserfs-notail changes allocation group decreases Same trendM2: as M1 Ext3-default across M1(~5-10%) and M2 performance 30% M2 1600 ops/sec 2000 1200 800 1786 1823 1623 1247 1229 1245 1100 676 1087 1181 1113 1117 675 400 0 Performance 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 23 800 Database Server (M1 vs. M2) ops/sec 600 429 361 400 392 429 377 402 442 442 271 182 200 217 209 220 Performance trend remains the same across Disk M1 intensive and M2 workload 0 M1 300 ops/sec 229 200 120 144 146 144 246 234 238 229 247 233 238 M2 vs. M1 Best Configs for 86% M1 35% and – M2 degradation Ext3 and XFS w/ BLK-2K 130 100 2K block size increases performance by ~1.5x 0 M2 05/26/2010 Performance Erez Zadok invited talk (ACM SYSTOR 2010) 24 Overview • Motivation • Related Work • Experimental Methodology • Evaluation Results • Conclusion and Future Work 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 25 Ongoing Work • We are evaluating end-to-end impact of workloads on NFSv4 servers • Several workloads • Mix clients and servers Same hardware Linux (Ubuntu, CentOS), FreeBSD, OpenSolaris 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 26 Results: Web Server, Server-wise 3000 2467 Peak Throughput (ops/sec) 2500 1973 2000 1576 1500 1048 1000 900 744 794 959 887 730 653 621 500 450 397 475 180 178 186 176 201 0 05/26/2010 CentOS-server Ubuntu-server FreeBSD-server OpenSolaris-server CentOS-client 744 1973 475 180 Ubuntu-client 794 2467 900 178 FreeBSD-client 730 1576 959 186 OpenSolaris-client 621 1048 653 176 LocalFS-client 397 450 887 201 Erez Zadok invited talk (ACM SYSTOR 2010) 27 Results: Mail Server, Server-wise 3000 2668 2560 2527 2447 2500 2356 2347 2270 Peak Throughput (ops/sec) 2197 2000 1692 1500 1356 1262 1052 1000 636 471 457 500 394 329 0 05/26/2010 316 273 254 CentOS-server Ubuntu-server FreeBSD-server OpenSolaris-server CentOS-client 2560 2270 1262 329 Ubuntu-client 2668 2356 1052 394 FreeBSD-client 2527 2347 1356 316 OpenSolaris-client 2447 2197 1692 273 LocalFS-client 457 636 471 254 Erez Zadok invited talk (ACM SYSTOR 2010) 28 Scaling Web Server Performance Ext2 Ext3 Reiserfs XFS 20000 18000 16000 Operations per Second 14000 12000 10000 8000 6000 4000 2000 0 10000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 160000 Number of Files 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010) 29 Conclusions • The Bad Software had gotten too complex Workloads drive performance-energy Depend also on hardware, software, configurations • The Good Significant savings possible Small savings accumulate over long run Commercial & Research opportunities • The Ugly 05/26/2010 Need workload-specific software Erez Zadok invited talk (ACM SYSTOR 2010) 30 Ongoing/Future Work • Study multiple dimensions New FS, Disk Scheduler, RAID, LVM, etc. Client/Server Systems Disk Types: SAS, SSD, etc. Cluster Storage, SANs, OS • Develop auto-configuration tools • Develop workload-specific storage stacks I/O 05/26/2010 schedulers, file systems, caching Erez Zadok invited talk (ACM SYSTOR 2010) 31 Server-Class Energy and Performance Evaluations Q&A Erez Zadok [email protected] File systems and Storage Lab Stony Brook University http://green.filesystems.org/ 05/26/2010 Erez Zadok invited talk (ACM SYSTOR 2010)