Presentation

Challenges in Building a
Commercial Deduplication
Storage System
Kai Li
Paul and Marcia Wythes Professor, Princeton University
&
Chief Scientist, Data Domain at EMC
6/8/11
Disclaimer…
The following is my opinion, not EMC
6/8/11
Kai Li
2
Why Did We Do It…
6/8/11
Kai Li
3
!"#$%&'()*!
6/8/11
Kai Li
Page 4
+%!,"-./01"2%3"4"%5$14$,%
Mirrored
storage
$$$$
$$$
Dedicated
Fibre
Clients
6/8/11
Server
Primary
storage
Kai Li
5
+%3"4"%5$14$,%'*.16%3.*)%&40,"6$7%
Mirrored
storage
Dedicate
d
Fibre
Clients
Server
Primary
storage
50*4*%
•  8',(9"*$:%;;;;%
•  <"1-=.-49:%;;;;%
•  80=$,:%;;;%
20 ! primary (3-month retention)
20 ! primary
WAN
Onsite
Remote
6/8/11
Kai Li
6
+%3"4"%5$14$,%%
'*.16%!"#$%&'()*+,-./+0)1"->(0?&@*4$A%
Mirrored
storage
Dedicated
WAN
Clients
Server
Primary
storage
B"2'$%8,0#0*./01*:%
•  8',(9"*$:%C!"#$%2.D,",.$*%
•  &#"($:%EF?GFH%,$-'(/01%
•  I+J%<I:%EF?KFH%,$-'(/01%
•  80=$,:%CEFH%,$-'(/01%
6/8/11
Onsite
WAN
Remote
Kai Li
7
<$L0,$%MEN%O"()*P%
Before: 17 Tape Libraries
6/8/11
Kai Li
8
+Q$,%%
After: 3 3-U Systems!
6/8/11
Kai Li
9
Revenue ($million)
3"4"%30A".1%8,0-'(4%O$R$1'$%
EEFF%
EFFF%
XFF%
TFF%
NFF%
VFF%
KFF%
UFF%
GFF%
WFF%
EFF%
F%
EEFF%
KNF%
WNU%
F%
F%
F%
FSNT% TSE%
UVSU%
EWGSV%
WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE%
A1 A2
6/8/11
B
C
Kai Li
IPO
M/A
10
EEFF%
EFFF%
XFF%
TFF%
NFF%
VFF%
KFF%
UFF%
GFF%
WFF%
EFF%
F%
EEFF%
4223<23;23:23923-
KNF%
823723-
WNU%
F%
F%
F%
FSNT% TSE%
UVSU%
EWGSV%
623-
My Time (%)
Revenue ($million)
Y@%5014,.D'/01*7%
52342323-
WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE%
A1 A2
6/8/11
B
C
Kai Li
IPO
M/A
11
Deduplication Storage
6/8/11
Kai Li
12
Deduplication: Find Redundancy
in A Large Window
“Local”
compression
Deduplication
~100k
Encode a sliding window
[Ziv&Lempel77]
WAN
~2-3X compression
6/8/11
~10-30X compression
Kai Li
13
Consolidated
…
Y$-."%
*$,R$,%
…
Backup data streams
3$-'#%&40,"6$%&@*4$A%L0,%<"()'#*%
…
data streams
3$-'#%&40,"6$%
5014,022$,%
Replication
I+J%
Y$-."%
*$,R$,%
Z1.['$%
3"4"%
&$6*%
Y$4"%
-"4"%
Physical storage
For deduped data
Dedup Storage
6/8/11
Kai Li
14
\0=%30$*%]4%I0,)7%
Segmented
data stream
3$-'#%&40,"6$%
5014,022$,%
Z1.['$%
3"4"%
&$6*%
Y$4"%
-"4"%
Dedup Storage
6/8/11
Kai Li
15
^._$-%R*S%B",."D2$%&$6A$14"/01%
•  ^._$-%*.`$%
U)%
U)%
4k
...
Cannot handle
deletes, shifts
...
No problem w/
deletes, shifts
[Manber93, Brin94]
•  5014$14?D"*$-a%R",."D2$%*.`$%
A X C D A Y C D A B C D A B
fp = 10110110
fp = 10110100
...
fp = 10110000
6/8/11
Kai Li
16
%Y0,$%3$4".2*%
Segmented
data stream
Segmented3$-'#%&40,"6$%
data stream 5014,022$,%
Y$4"%
-"4"%
Z1.['$%
3"4"%
&$6*%
Dedup Storage
^0,%$"(9%*$6A$14%
•  50A#'4$%"%*4,016%b16$,#,.14%
•  Z*$%"1%.1-$_%40%200)'#%
3$-'#%&40,"6$%
–  ]L%'1.['$%
5014,022$,%
•  #'4%b16$,#,.14%.140%.1-$_%
•  "##2@%20("2%(0A#,$**.01%40%*$6A$14%
•  *40,$%*$6A$14%
•  Z1.['$%
*40,$%A$4"%-"4"%
Y$4"%
3"4"%
– 
]L%-'#2.("4$%
-"4"%
&$6*%
•  *40,$%A$4"%-"4"%
Dedup Storage
6/8/11
Kai Li
17
Y0,$%3$4".2*%501/1'$-%
Segmented
data stream
3$-'#%&40,"6$%
5014,022$,%
Y$4"%
-"4"%
Z1.['$%
3"4"%
&$6*%
501(',,$14%c",D"6$%5022$(/01%
•  +%*$6A$14%A"@%D$%*9",$-%D@%
A'2/#2$%b2$*%
•  <"()'#*%1$$-%40%D$%-$2$4$-%
*0A$/A$%
•  c5%("1104%.14$,L$,$%=.49%D"()'#*%
Dedup Storage
^E%
+%
6/8/11
Kai Li
^W%
<%
5%
3%
18
Design Challenges
6/8/11
Kai Li
19
59"22$16$*%L0,%<"()'#%&40,"6$%
•  ^"(4*%
–  3"4"%-0'D2$*%$R$,@%ET%A0149*%
–  WU%90',*%d%-"@%
•  O$['.,$A$14*%
–  50A#2$4$%D"()'#*%=.49.1%49$%eD"()'#%=.1-0=%/A$f%
–  ^"*4%,$(0R$,@%L,0A%20("2%0,%,$A04$%D"()'#*%
–  J0%1$=%D'-6$4%
•  59"22$16$*%
–  g0=%(0*4%
–  \.69%(0A#,$**.01%,"/0%
–  ]1(,$"*$%-$-'#2.("/01%49,0'69#'4%"1-%("#"(.4@%
–  \.69%"R".2"D.2.4@%"1-%-"4"%.14$6,.4@%
6/8/11
Kai Li
20
^0,($*%.1%3$-'#%&40,"6$%3$*.61%
More disks
More
CPU
Less , RAM
disks
Reliability
Hig
her
Ded
up
ratio
s
6/8/11
Kai Li
21
O$2."D.2.4@%"1-%3"4"%]14$6,.4@%
•  Y'(9%A0,$%*$,.0'*2@%9$,$%49"1%#,.A",@%*40,"6$%
–  g04*%0L%,$-'1-"1(@%9"*%D$$1%,$A0R$-%MGF_hP%
–  g"*4%*40#%0L%-"4"%#,04$(/01%
•  3"4"%30A".1i*%"##,0"(9%
–  3"4"%.*%*40,$-%.1%"%206%0L%*$2L?-$*(,.D.16%(014".1$,*%
–  +##$1-%012@%40%"R0.-%0R$,=,.4$*%
–  B$,[email protected]%(014".1$,*%"1-%b2$*%ME*4%/A$%"1-%"22%49$%/A$P%
–  Y$4"%-"4"%,$(01*4,'(/01%L,0A%(014".1$,*%
–  JBO+Y%2066.16%L0,%(,"*9%,$(0R$,@%"1-%L"'24%.*02"/01%
–  &$2L?(0,,$(/01%L,0A%*0Q=",$%O+]3?V%
6/8/11
Kai Li
22
\.69%3$-'#2.("/01%O"/0%
•  \.69%-$-'#2.("/01%L"(40,%!%9",-=",$%(0*4%
–  &A"22$,%*$6A$14*%"(9.$R$%9.69$,%(0A#,$**.01%,"/0*%
–  &A"22$,%*$6A$14*%.A#2@%9.69$,%,"/0%0L%b16$,#,.14%.1-$_%
*.`$%40%#9@*.("2%-.*)%*40,"6$%
•  3"4"%30A".1i*%"##,0"(9%
–  Z1-$,*4"1-%49$%*=$$4%*#04*%0L%*$6A$14%*.`$*%
–  Y'2/#2$%20("2%(0A#,$**.01%"260,.49A*%
6/8/11
Kai Li
23
Segment Sizes for Backup Data
Rule of thumb: 2X segment size will
" 
" 
" 
increase space for unique segments by 15%
decrease metadata by about 50%
deduce disk I/Os for writes and reads
6/8/11
Kai Li
24
Real World Example at Datacenter A
6/8/11
Kai Li
25
Real World Compression at Datacenter A
6/8/11
Kai Li
26
Real World Example at Datacenter B
6/8/11
Kai Li
27
Real World Compression at Datacenter B
6/8/11
Kai Li
28
High Deduplication Throughput
"  Need
to double every 18 months or faster
Data grows fast
#  Backup window time is fixed
#  Complete backups within the backup window time
# 
"  Data
Domain’s approach
A sophisticated cache for index
#  Several techniques to reduce memory and CPU
requirements
#  Bet on multicore processors
# 
6/8/11
Kai Li
29
Why Challenging?
Divide data streams
into segments
Lookup
6/8/11
Fingerprint
Index
Kai Li
Index size for 80TB
w/ 8KB segments
= (80TB/8KB) * 20B
= 200GB!
30
Caching?
Divide data streams
into segments
Miss
Lookup
Index
Cache
Fingerprint
Index
Problem: No locality.
6/8/11
Kai Li
31
Parallel Index Need Many Disks [Venti02]
Divide data streams
into segments
Miss
Lookup
...
Index
Cache
Problem:
Need a lot of disks.
7200RPM disk does 120 lookups/sec.
1MB/sec with 8KB segment per disk
1GB/sec needs 1,000 disks!
6/8/11
Fingerprint
Index
Kai Li
32
Staging Needs More Disks
Data Streams
Divide data streams
into segments
Fingerprint
Index
Lookup
...
Very Big Disk Buffer
Problem: The Buffer needs to be as large or
larger than the full backup!
Big delay and may still never catch up
6/8/11
Kai Li
33
A Combination of Techniques
"  Layout
data on disk with “duplicate locality”
"  A sophisticated cache for the fingerprint index
Summary data structure for new data
#  “locality-preserved caching” for old data
# 
"  Parallelized
software systems to leverage
multicore processors
Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk
Bottleneck in the Data Domain Deduplication File System. In
Proceedings of The 6th USENIX Conference on File and Storage
Technologies (FAST’08). February 2008
6/8/11
Kai Li
34
Summary Vector
Goal: Use minimal memory to test for new data
! Summarize what segments have been stored, with
Bloom filter (Bloom’70) in RAM
! If Summary Vector says no, it’s new segment
Summary Vector
Approximation
Index Data Structure
6/8/11
35
Stream Informed Segment Layout
Goal: Capture “duplicate locality” on disk
Segments from the same stream are stored in the
same “containers”
#  Metadata (index data) are also in the containers
# 
6/8/11
36
Locality Preserved Caching (LPC)
Goal: Maintain “duplicate locality” in the cache
Disk Index has all <fingerprint, containerID> pairs
#  Index Cache caches a subset of such pairs
#  On a miss, lookup Disk Index to find containerID
#  Load the metadata of a container into Index Cache,
replace if needed
# 
Metadata
Miss
Disk
Index
Replacement
Load
metadata
ContainerID
Data
Index
Cache
Container
6/8/11
37
Putting Them Together
A fingerprint
Duplicate
Index
Cache
No
New
Replacement
Summary
Vector
Maybe
Disk
Index
6/8/11
metadata
metadata
metadata
metadata
data
data
data
data
38
Disk I/O Reduction Results
Exchange data (2.56TB) Engineering data (2.39TB)
135-daily full backups
# disk I/Os
No summary,
No SISL/LPC
% of total
100-day daily inc, weekly full
# disk I/Os
% of total
328,613,503 100.00%
318,236,712
100.00%
Summary only
274,364,788
83.49%
259,135,171
81.43%
SISL/LPC only
57,725,844
17.57%
60,358,875
18.97%
Summary &
SISL/LPC
3,477,129
1.06%
1,257,316
0.40%
6/8/11
Kai Li
39
EEFF%
EFFF%
XFF%
TFF%
NFF%
VFF%
KFF%
UFF%
GFF%
WFF%
EFF%
F%
UFTG%
72226822622258225222-
EKFF%
48224222-
UF%
EFF% WFF%
TF%
UFF%
8222-
Dedup Throughput (MB/sec)
Revenue ($million)
O$R$1'$%R*S%3$-'#%!9,0'69#'4%
WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE%
A1 A2
B
C
IPO
M/A
Dedup throughput improved by ~100X in 6 years
6/8/11
Kai Li
40
EEFF%
EFFF%
XFF%
TFF%
NFF%
VFF%
KFF%
UFF%
GFF%
WFF%
EFF%
F%
WXF%
622582522-
EUF%
482422-
UT%
T%
FSTNK%U%
WF%
82-
Physical Capacity (TB)
Revenue ($million)
O$R$1'$%R*S%89@*.("2%5"#"(.4@%
2-
WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE%
A1 A2
B
C
IPO
M/A
Usable physical space increased by ~330X in 6 years
6/8/11
Kai Li
41
Engineering Challenges
6/8/11
Kai Li
42
EEFF%
EFFF%
XFF%
TFF%
NFF%
VFF%
KFF%
UFF%
GFF%
WFF%
EFF%
F%
UFFF%
GKFF%
GFFF%
WKFF%
WFFF%
EKFF%
EFFF%
# of Lines (1,000)
Revenue ($million)
O$R$1'$%R*S%&0Q=",$%50A#2$_.4@%
KFF%
F%
WFFE% WFFW% WFFG% WFFU% WFFK% WFFV% WFFN% WFFT% WFFX% WFEF% WFEE%
A1 A2
B
C
IPO
M/A
Most challenging Period
6/8/11
Kai Li
43
O$"2%59"22$16$%]*%O0"-A"#%
•  ^.,*4%#,0-'(4%-$b1./01%
–  5'*40A$,*%",$%*'*#.(.0'*%"D0'4%1$=%*40,"6$%#,0-'(4*%
–  Y'*4%#,0R.-$%$10'69%R"2'$%#,0#0*./01*%
–  Y'*4%D"2"1($%$16.1$$,.16%$j0,4*%
•  ^0220=.16%,$2$"*$*%
–  8,0R.-$%,.694%R"2'$%#,0#0*./01*%
–  50A#$//R$%.1%A",)$4%#2"($%
–  3$2.R$,%['"2.4@%,$2$"*$%01%/A$%
6/8/11
Kai Li
44
3.*,'#/R$%&4",4*%"4%g0=%k'"2.4@%
3.*,'#4%"1%$_.*/16%A",)$4%
•  ]A#,0R$%"%*$,R.($%.1%="@*%49"4%49$%A",)$4%-0$*%104%$_#$(4%
Clayton M. Christensen,
The Innovator's Dilemma.
1997
6/8/11
Kai Li
45
I9"4i*%&#$(."2%"D0'4%&40,"6$%8,0-'(4%
•  Y",)$4%$14,@%D",,.$,%.*%9.69$,%L0,%-"4"%($14$,*%
•  !9$,$%.*%"%A"6.(%1'AD$,%L0,%A.2$"6$%
•  Y"1@%.1R$*40,*%",$%.A#"/$14%
6/8/11
Kai Li
46
Future…
6/8/11
Kai Li
47
^'4',$%]A#"(4%"1-%59"22$16$*%
•  J$",2.1$%*40,"6$%
–  \"1-2$%A"1@%2",6$%"1-%*A"22%b2$*%
–  O$2"/R$2@%20=%2"4$1(@%L0,%*A"22%]dl*%
•  +,(9.R"2%*40,"6$%
–  J$$-%20().16a%*9,$--.16a%2016%4$,A%,$4$1/01a%m%
–  ^',49$,%.A#,0R$%(0A#,$**.01%,"/0*%
•  8,.A",@%*40,"6$%
–  O$-'($%49$%(0*4%0L%^2"*9%=.490'4%*"(,.b(.16%#$,L0,A"1($%
–  O$49.1)%49$%*40,"6$%$(0?*@*4$A%L0,%-"4"%($14$,*%
•  520'-%*40,"6$%
–  I9"4%",$%49$%,.694%D'.2-.16%D20()*%
6/8/11
Kai Li
48
!9"1)%n0'%
6/8/11
Kai Li
49