Challenges in Building a Commercial Deduplication Storage System Kai Li Paul and Marcia Wythes Professor, Princeton University & Chief Scientist, Data Domain at EMC 6/8/11 Disclaimer… The following is my opinion, not EMC 6/8/11 Kai Li 2 Why Did We Do It… 6/8/11 Kai Li 3 !"#$%&'()*! 6/8/11 Kai Li Page 4 +%!,"-./01"2%3"4"%5$14$,% Mirrored storage $$$$ $$$ Dedicated Fibre Clients 6/8/11 Server Primary storage Kai Li 5 +%3"4"%5$14$,%'*.16%3.*)%&40,"6$7% Mirrored storage Dedicate d Fibre Clients Server Primary storage 50*4*% • 8',(9"*$:%;;;;% • <"1-=.-49:%;;;;% • 80=$,:%;;;% 20 ! primary (3-month retention) 20 ! primary WAN Onsite Remote 6/8/11 Kai Li 6 +%3"4"%5$14$,%% '*.16%!"#$%&'()*+,-./+0)1"->(0?&@*4$A% Mirrored storage Dedicated WAN Clients Server Primary storage B"2'$%8,0#0*./01*:% • 8',(9"*$:%C!"#$%2.D,",.$*% • &#"($:%EF?GFH%,$-'(/01% • I+J%<I:%EF?KFH%,$-'(/01% • 80=$,:%CEFH%,$-'(/01% 6/8/11 Onsite WAN Remote Kai Li 7 <$L0,$%MEN%O"()*P% Before: 17 Tape Libraries 6/8/11 Kai Li 8 +Q$,%% After: 3 3-U Systems! 6/8/11 Kai Li 9 Revenue ($million) 3"4"%30A".1%8,0-'(4%O$R$1'$% EEFF% EFFF% XFF% TFF% NFF% VFF% KFF% UFF% GFF% WFF% EFF% F% EEFF% KNF% WNU% F% F% F% FSNT% TSE% UVSU% EWGSV% WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE% A1 A2 6/8/11 B C Kai Li IPO M/A 10 EEFF% EFFF% XFF% TFF% NFF% VFF% KFF% UFF% GFF% WFF% EFF% F% EEFF% 4223<23;23:23923- KNF% 823723- WNU% F% F% F% FSNT% TSE% UVSU% EWGSV% 623- My Time (%) Revenue ($million) Y@%5014,.D'/01*7% 52342323- WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE% A1 A2 6/8/11 B C Kai Li IPO M/A 11 Deduplication Storage 6/8/11 Kai Li 12 Deduplication: Find Redundancy in A Large Window “Local” compression Deduplication ~100k Encode a sliding window [Ziv&Lempel77] WAN ~2-3X compression 6/8/11 ~10-30X compression Kai Li 13 Consolidated … Y$-."% *$,R$,% … Backup data streams 3$-'#%&40,"6$%&@*4$A%L0,%<"()'#*% … data streams 3$-'#%&40,"6$% 5014,022$,% Replication I+J% Y$-."% *$,R$,% Z1.['$% 3"4"% &$6*% Y$4"% -"4"% Physical storage For deduped data Dedup Storage 6/8/11 Kai Li 14 \0=%30$*%]4%I0,)7% Segmented data stream 3$-'#%&40,"6$% 5014,022$,% Z1.['$% 3"4"% &$6*% Y$4"% -"4"% Dedup Storage 6/8/11 Kai Li 15 ^._$-%R*S%B",."D2$%&$6A$14"/01% • ^._$-%*.`$% U)% U)% 4k ... Cannot handle deletes, shifts ... No problem w/ deletes, shifts [Manber93, Brin94] • 5014$14?D"*$-a%R",."D2$%*.`$% A X C D A Y C D A B C D A B fp = 10110110 fp = 10110100 ... fp = 10110000 6/8/11 Kai Li 16 %Y0,$%3$4".2*% Segmented data stream Segmented3$-'#%&40,"6$% data stream 5014,022$,% Y$4"% -"4"% Z1.['$% 3"4"% &$6*% Dedup Storage ^0,%$"(9%*$6A$14% • 50A#'4$%"%*4,016%b16$,#,.14% • Z*$%"1%.1-$_%40%200)'#% 3$-'#%&40,"6$% – ]L%'1.['$% 5014,022$,% • #'4%b16$,#,.14%.140%.1-$_% • "##2@%20("2%(0A#,$**.01%40%*$6A$14% • *40,$%*$6A$14% • Z1.['$% *40,$%A$4"%-"4"% Y$4"% 3"4"% – ]L%-'#2.("4$% -"4"% &$6*% • *40,$%A$4"%-"4"% Dedup Storage 6/8/11 Kai Li 17 Y0,$%3$4".2*%501/1'$-% Segmented data stream 3$-'#%&40,"6$% 5014,022$,% Y$4"% -"4"% Z1.['$% 3"4"% &$6*% 501(',,$14%c",D"6$%5022$(/01% • +%*$6A$14%A"@%D$%*9",$-%D@% A'2/#2$%b2$*% • <"()'#*%1$$-%40%D$%-$2$4$-% *0A$/A$% • c5%("1104%.14$,L$,$%=.49%D"()'#*% Dedup Storage ^E% +% 6/8/11 Kai Li ^W% <% 5% 3% 18 Design Challenges 6/8/11 Kai Li 19 59"22$16$*%L0,%<"()'#%&40,"6$% • ^"(4*% – 3"4"%-0'D2$*%$R$,@%ET%A0149*% – WU%90',*%d%-"@% • O$['.,$A$14*% – 50A#2$4$%D"()'#*%=.49.1%49$%eD"()'#%=.1-0=%/A$f% – ^"*4%,$(0R$,@%L,0A%20("2%0,%,$A04$%D"()'#*% – J0%1$=%D'-6$4% • 59"22$16$*% – g0=%(0*4% – \.69%(0A#,$**.01%,"/0% – ]1(,$"*$%-$-'#2.("/01%49,0'69#'4%"1-%("#"(.4@% – \.69%"R".2"D.2.4@%"1-%-"4"%.14$6,.4@% 6/8/11 Kai Li 20 ^0,($*%.1%3$-'#%&40,"6$%3$*.61% More disks More CPU Less , RAM disks Reliability Hig her Ded up ratio s 6/8/11 Kai Li 21 O$2."D.2.4@%"1-%3"4"%]14$6,.4@% • Y'(9%A0,$%*$,.0'*2@%9$,$%49"1%#,.A",@%*40,"6$% – g04*%0L%,$-'1-"1(@%9"*%D$$1%,$A0R$-%MGF_hP% – g"*4%*40#%0L%-"4"%#,04$(/01% • 3"4"%30A".1i*%"##,0"(9% – 3"4"%.*%*40,$-%.1%"%206%0L%*$2L?-$*(,.D.16%(014".1$,*% – +##$1-%012@%40%"R0.-%0R$,=,.4$*% – B$,[email protected]%(014".1$,*%"1-%b2$*%ME*4%/A$%"1-%"22%49$%/A$P% – Y$4"%-"4"%,$(01*4,'(/01%L,0A%(014".1$,*% – JBO+Y%2066.16%L0,%(,"*9%,$(0R$,@%"1-%L"'24%.*02"/01% – &$2L?(0,,$(/01%L,0A%*0Q=",$%O+]3?V% 6/8/11 Kai Li 22 \.69%3$-'#2.("/01%O"/0% • \.69%-$-'#2.("/01%L"(40,%!%9",-=",$%(0*4% – &A"22$,%*$6A$14*%"(9.$R$%9.69$,%(0A#,$**.01%,"/0*% – &A"22$,%*$6A$14*%.A#2@%9.69$,%,"/0%0L%b16$,#,.14%.1-$_% *.`$%40%#9@*.("2%-.*)%*40,"6$% • 3"4"%30A".1i*%"##,0"(9% – Z1-$,*4"1-%49$%*=$$4%*#04*%0L%*$6A$14%*.`$*% – Y'2/#2$%20("2%(0A#,$**.01%"260,.49A*% 6/8/11 Kai Li 23 Segment Sizes for Backup Data Rule of thumb: 2X segment size will " " " increase space for unique segments by 15% decrease metadata by about 50% deduce disk I/Os for writes and reads 6/8/11 Kai Li 24 Real World Example at Datacenter A 6/8/11 Kai Li 25 Real World Compression at Datacenter A 6/8/11 Kai Li 26 Real World Example at Datacenter B 6/8/11 Kai Li 27 Real World Compression at Datacenter B 6/8/11 Kai Li 28 High Deduplication Throughput " Need to double every 18 months or faster Data grows fast # Backup window time is fixed # Complete backups within the backup window time # " Data Domain’s approach A sophisticated cache for index # Several techniques to reduce memory and CPU requirements # Bet on multicore processors # 6/8/11 Kai Li 29 Why Challenging? Divide data streams into segments Lookup 6/8/11 Fingerprint Index Kai Li Index size for 80TB w/ 8KB segments = (80TB/8KB) * 20B = 200GB! 30 Caching? Divide data streams into segments Miss Lookup Index Cache Fingerprint Index Problem: No locality. 6/8/11 Kai Li 31 Parallel Index Need Many Disks [Venti02] Divide data streams into segments Miss Lookup ... Index Cache Problem: Need a lot of disks. 7200RPM disk does 120 lookups/sec. 1MB/sec with 8KB segment per disk 1GB/sec needs 1,000 disks! 6/8/11 Fingerprint Index Kai Li 32 Staging Needs More Disks Data Streams Divide data streams into segments Fingerprint Index Lookup ... Very Big Disk Buffer Problem: The Buffer needs to be as large or larger than the full backup! Big delay and may still never catch up 6/8/11 Kai Li 33 A Combination of Techniques " Layout data on disk with “duplicate locality” " A sophisticated cache for the fingerprint index Summary data structure for new data # “locality-preserved caching” for old data # " Parallelized software systems to leverage multicore processors Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of The 6th USENIX Conference on File and Storage Technologies (FAST’08). February 2008 6/8/11 Kai Li 34 Summary Vector Goal: Use minimal memory to test for new data ! Summarize what segments have been stored, with Bloom filter (Bloom’70) in RAM ! If Summary Vector says no, it’s new segment Summary Vector Approximation Index Data Structure 6/8/11 35 Stream Informed Segment Layout Goal: Capture “duplicate locality” on disk Segments from the same stream are stored in the same “containers” # Metadata (index data) are also in the containers # 6/8/11 36 Locality Preserved Caching (LPC) Goal: Maintain “duplicate locality” in the cache Disk Index has all <fingerprint, containerID> pairs # Index Cache caches a subset of such pairs # On a miss, lookup Disk Index to find containerID # Load the metadata of a container into Index Cache, replace if needed # Metadata Miss Disk Index Replacement Load metadata ContainerID Data Index Cache Container 6/8/11 37 Putting Them Together A fingerprint Duplicate Index Cache No New Replacement Summary Vector Maybe Disk Index 6/8/11 metadata metadata metadata metadata data data data data 38 Disk I/O Reduction Results Exchange data (2.56TB) Engineering data (2.39TB) 135-daily full backups # disk I/Os No summary, No SISL/LPC % of total 100-day daily inc, weekly full # disk I/Os % of total 328,613,503 100.00% 318,236,712 100.00% Summary only 274,364,788 83.49% 259,135,171 81.43% SISL/LPC only 57,725,844 17.57% 60,358,875 18.97% Summary & SISL/LPC 3,477,129 1.06% 1,257,316 0.40% 6/8/11 Kai Li 39 EEFF% EFFF% XFF% TFF% NFF% VFF% KFF% UFF% GFF% WFF% EFF% F% UFTG% 72226822622258225222- EKFF% 48224222- UF% EFF% WFF% TF% UFF% 8222- Dedup Throughput (MB/sec) Revenue ($million) O$R$1'$%R*S%3$-'#%!9,0'69#'4% WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE% A1 A2 B C IPO M/A Dedup throughput improved by ~100X in 6 years 6/8/11 Kai Li 40 EEFF% EFFF% XFF% TFF% NFF% VFF% KFF% UFF% GFF% WFF% EFF% F% WXF% 622582522- EUF% 482422- UT% T% FSTNK%U% WF% 82- Physical Capacity (TB) Revenue ($million) O$R$1'$%R*S%89@*.("2%5"#"(.4@% 2- WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE% A1 A2 B C IPO M/A Usable physical space increased by ~330X in 6 years 6/8/11 Kai Li 41 Engineering Challenges 6/8/11 Kai Li 42 EEFF% EFFF% XFF% TFF% NFF% VFF% KFF% UFF% GFF% WFF% EFF% F% UFFF% GKFF% GFFF% WKFF% WFFF% EKFF% EFFF% # of Lines (1,000) Revenue ($million) O$R$1'$%R*S%&0Q=",$%50A#2$_.4@% KFF% F% WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE% A1 A2 B C IPO M/A Most challenging Period 6/8/11 Kai Li 43 O$"2%59"22$16$%]*%O0"-A"#% • ^.,*4%#,0-'(4%-$b1./01% – 5'*40A$,*%",$%*'*#.(.0'*%"D0'4%1$=%*40,"6$%#,0-'(4*% – Y'*4%#,0R.-$%$10'69%R"2'$%#,0#0*./01*% – Y'*4%D"2"1($%$16.1$$,.16%$j0,4*% • ^0220=.16%,$2$"*$*% – 8,0R.-$%,.694%R"2'$%#,0#0*./01*% – 50A#$//R$%.1%A",)$4%#2"($% – 3$2.R$,%['"2.4@%,$2$"*$%01%/A$% 6/8/11 Kai Li 44 3.*,'#/R$%&4",4*%"4%g0=%k'"2.4@% 3.*,'#4%"1%$_.*/16%A",)$4% • ]A#,0R$%"%*$,R.($%.1%="@*%49"4%49$%A",)$4%-0$*%104%$_#$(4% Clayton M. Christensen, The Innovator's Dilemma. 1997 6/8/11 Kai Li 45 I9"4i*%&#$(."2%"D0'4%&40,"6$%8,0-'(4% • Y",)$4%$14,@%D",,.$,%.*%9.69$,%L0,%-"4"%($14$,*% • !9$,$%.*%"%A"6.(%1'AD$,%L0,%A.2$"6$% • Y"1@%.1R$*40,*%",$%.A#"/$14% 6/8/11 Kai Li 46 Future… 6/8/11 Kai Li 47 ^'4',$%]A#"(4%"1-%59"22$16$*% • J$",2.1$%*40,"6$% – \"1-2$%A"1@%2",6$%"1-%*A"22%b2$*% – O$2"/R$2@%20=%2"4$1(@%L0,%*A"22%]dl*% • +,(9.R"2%*40,"6$% – J$$-%20().16a%*9,$--.16a%2016%4$,A%,$4$1/01a%m% – ^',49$,%.A#,0R$%(0A#,$**.01%,"/0*% • 8,.A",@%*40,"6$% – O$-'($%49$%(0*4%0L%^2"*9%=.490'4%*"(,.b(.16%#$,L0,A"1($% – O$49.1)%49$%*40,"6$%$(0?*@*4$A%L0,%-"4"%($14$,*% • 520'-%*40,"6$% – I9"4%",$%49$%,.694%D'.2-.16%D20()*% 6/8/11 Kai Li 48 !9"1)%n0'% 6/8/11 Kai Li 49