Slides

Practical Methods for
Constructing Suffix
Trees
Advisor: Jignesh M. Patel
Yuanyuan Tian
1
Roadmap
„ Background
„
„
& Motivation
Our Approaches
¾
Top-Down Disk-Based Algorithm (TDD)
¾
Merge-Based Algorithm (ST-Merge)
Experiments and Analysis
2
Background
„
„
Sequence data sets are ubiquitous in modern
life science and traditional text applications, and
querying sequences is a common and critical
operation in these applications.
Suffix trees are versatile data structures that can
help execute a variety of sequence matching
queries efficiently.
e.g. pattern matching, biological sequence
alignment (OASIS, MUMer, REPuter, etc.)
3
Suffix Tree
String: ATTAGTACA$
Suffixes:
0 1 2 3 4 5 6 7 8 9
S0: ATTAGTACA$
S1: TTAGTACA$
S5: TACA$
S6: ACA$
S7: CA$
S8: A$
A
$
A
C
G
T
G
C
A
$
T
A
G
T
A
T
A
C
A
A
$
$
C G
A
T
$
A
C
A
$
A
C
S4: GTACA$
T
T
T
A
G
T
A
C
A
$
S3: AGTACA$
$
A
S2: TAGTACA$
C
A
$
$
S9: $
Storage requirement of a suffix tree is O(n).
4
Existing Algorithms
In-Memory Algorithms
„
Quadratic algorithms
¾
¾
¾
„
Brute-force -- Insert a suffix at a time
WOTD -- good locality of reference
Worst case O(n2), average O(nlogn)
Linear time algorithms
¾
¾
¾
¾
Best-known -- Ukkonen, McCreight, Weiner
Use suffix links
Poor locality of memory reference
Only suited for small input size
5
Existing Algorithms
On-Disk Algorithms
„
Practical Algorithms
¾
TOP-Q
™
¾
Hunt’s algorithm -- the best known practical disk-based
algorithm
™
™
„
Buffering strategy to improve the performance of Ukkonen’s
algorithm
Construct different parts of the suffix tree independently
Use Brute-Force -- random access to the tree
Theoretical Algorithms
¾
¾
Reduce complexity to sorting, but tricky to implement
Build a suffix array from which a suffix tree is built
™
SKEW -- linear suffix array construction algorithm
6
Existing Algorithms
Human Chromosome 1 (227MB)
25
20
15
hrs
10
5
0
Ukkonen
Hunt
TDD
>1 day
~1.5hrs
~17mins
7
Motivation
Many sequence datasets are growing at
exponential rates. (e.g. GenBank
x2/16mo.)
„ Existing techniques can only deal with
small trees, for big datasets they are
impractical.
„
8
Roadmap
„ Background
„
„
& Motivation
Our Approaches
¾
Top-Down Disk-Based Algorithm (TDD)
¾
Merge-Based Algorithm (ST-Merge)
Experiments and Analysis
9
Suffix Tree Representation
String: ATTAGTACA$
Suffixes:
Alphabet Order: A<T<C<G<$
0 1 2 3 4 5 6 7 8 9
$
A
S4: GTACA$
S5: TACA$
S6: ACA$
A
$
S7: CA$
A
C
C
A
$
T
A
G
T
A
T
G
T
A
S8: A$
C
A
S9: $
4
9R
$
3
A
8
9
10
2R
7
4R
1
7
4
9R
GTACA$
7
7
CA$
T
6
10
$
TTAGTACA$
7
5
$
GTACA$
A
1
4
GTACA$
12
3
CA$
0
2
A
CA$
1
C
TAGTACA$
0
A
$
C G
A
T
$
A
C
A
$
S3: AGTACA$
G
T
T
T
A
G
T
A
C
A
$
S2: TAGTACA$
$
S1: TTAGTACA$
A
C
S0: ATTAGTACA$
11
12
13
14
15
$
10
Top-Down Disk-Based Algorithm
(TDD)
„
Leverage partitioning to make use of mainmemory efficiently and reduce disk I/Os
¾
„
Minimize random references and try to use
sequential access
¾
„
Use Hunt’s partitioning strategy
Based on WOTD -- good locality behavior concerning
tree access
Manage buffers for large structures so that
disk I/Os are reduced
11
Data Structures
ATTAGTACA$
String
0
12
1
7
7
Tree (8.5x)
S0: ATTAGTACA$
S0: ATTAGTACA$
S3: AGTACA$
S1: TTAGTACA$
S6: ACA$
S2: TAGTACA$
S8: A$
S3: AGTACA$
S1: TTAGTACA$
S4: GTACA$
S2: TAGTACA$
S5: TACA$
S5: TACA$
S6: ACA$
S7: CA$
S7: CA$
S4: GTACA$
S8: A$
S9: $
S9: $
Suffixes (4x)
…
Temp (4x)
12
TDD Execution
Phase 1: partitioning the suffixes of the input string into |A|prefixlen
partitions
S0: ATTAGTACA$
S0: ATTAGTACA$
S1: TTAGTACA$
S3: AGTACA$
S2: TAGTACA$
prefixlen=1
S6: ACA$
S3: AGTACA$
S8: A$
S4: GTACA$
S1: TTAGTACA$
S5: TACA$
S2: TAGTACA$
S6: ACA$
S5: TACA$
S7: CA$
S7: CA$
S8: A$
S4: GTACA$
S9: $
S9: $
Time Complexity O(n*prefixlen)
13
TDD Execution
Phase 2
String: ATTAGTACA$
Longest Common Prefix
S1: TTAGTACA$
LCP=T
S3: AGTACA$
S5: TACA$
S6: ACA$
A
LCP=A
T
C
G
T
A
$
S4: GTACA$
S6: ACA$
S7: CA$
S2: TAGTACA$
A
C
A
$
0
1
2
3
4
5
6
1
2
3
5
2R
7
4R
GTACA$
A
A
S3: AGTACA$
CA$
G
T
S3,
S6S5
S1,
S2,
TAGTACA$
$
C
S2: TAGTACA$
S2: TAGTACA$
T
A
0 1 2 3 4 5 6 7 8 9
T
A
Average Time Complexity: O(nlog|A|n)
S7: CA$
S4: GTACA$
14
Buffer Management
Replacement Policy
Data
Structure
String
Access Pattern Replacement
Policy
Random
Random
Suffixes
Less Random
Random/LRU
Temp
Sequential
MRU
Tree
Mostly
Sequential
LRU
15
Buffer Management
Buffer Allocation
String
Suffix
Temp
Tree
Dataset Size
Small Dataset
Medium Dataset
Large Dataset
0%
20%
40%
60%
80%
100%
Memory allocated (%)
16
Merge-Based Algorithm
(ST-Merge)
When string size is larger than main
memory, the performance of TDD
degrades rapidly.
„ Reduce random accesses to large string
by partitioning suffixes so that each
partition contains adjacent suffixes.
„
17
ST-Merge Algorithm
Suffixes
S0,S1,S2……………………………………………………… …………………….. Sn-1,Sn
Phase 1
Use TDD
without
partitioning
Merge
Phase 2
Merged
2 subroutines:
NodeMerge,EdgeMerge
(stack-based recursive
routines)
Tree
18
A
T
T
A
G
T
A
C
A
$
G
A
T
G
A
T
C
A
A
C
$
A
$
S9: $
S8: A$
S7: CA$
S6: ACA$
S5: TACA$
S4: GTACA$
ST1
S3: AGTACA$
0 1 2 3 4 5 6 7 8 9
S2: TAGTACA$
String: ATTAGTACA$
S1: TTAGTACA$
S0: ATTAGTACA$
Phase 1
ST2
$
A
T
G
T
A
T
C
A
A
G
$
T
A
C
A
$
$
A
C
T
A
C
A
$
$
C
A
$
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
0
7
1
5
4R
3
2R
1
4R
6
5
5
7
9R
7
9R
Access to String:
ATTAGTACA$
ATTAGTACA$
Better locality of reference to the string!
19
String: ATTAGTACA$
0
1
2
3
4
ST1: 0
7
1
5
4R
T
A
A
C
A
G
$
T
A
C
A
$
$
5
6
7
8
3
2R
1
4R
$
C
A
$
C
A
$
0
1
2
3
4
5
6
ST2: 6
5
5
7
9R
7
9R
ST2[0,1]: A
T
T
A
T
A
C
$ A
$
ST1[0,1]: A
G
A
$
C
A
0 1 2 3 4 5 6 7 8 9
G
T
G A
T
G
A
T
C
A
A
C
$
A
$
ST1[2,3]: T
A
T
T
A
G
T
A
C
A
$
T
ST2[2] : TACA$
$
A
A
C
Phase 2
ST2[3] : CA$
ST1[4] : GTACA$
ST2[4] : $
Edge_Merge
ST1[2], ST2[2]
Edge_Merge
Node_Merge
0
1
0
2
1
3
4
7
5
4
6
9R
7
8
9
10
11
12
13
14
15
ST1[0], ST2[0]
20
String: ATTAGTACA$
1
2
3
4
ST1: 0
7
1
5
4R
T
A
A
C
A
G
$
T
A
C
A
$
$
5
6
7
8
3
2R
1
4R
$
C
A
$
$
A
C
A
$
A
$
0
1
2
3
4
5
6
ST2: 6
5
5
7
9R
7
9R
ST1[2,3]: T
T
T
C
LCP=T
G
A
A
$
C
T
ST2[2] : TACA$
A
0
0 1 2 3 4 5 6 7 8 9
G
T
G A
T
G
A
T
C
A
A
C
$
A
$
A
$
T
T
A
G
T
A
C
A
$
T
A
C
A
Node_Merge
Edge_Merge
ST1[5],ST2[2]:1
ST1[2], ST2[2]
Edge_Merge
0
1
0
2
1
3
4
7
5
4
6
9R
7
8
9
10
11
12
13
14
15
ST1[0], ST2[0]
21
String: ATTAGTACA$
1
2
3
4
ST1: 0
7
1
5
4R
T
A
A
C
A
G
$
T
A
C
A
$
$
5
6
7
8
3
2R
1
4R
$
A
C
A
$
T
A
G
T
A
$
A
C
A
$
A
$
1
2
3
4
5
6
ST2: 6
5
5
7
9R
7
9R
: AGTACA$
ST2[2](1): ACA$
T
T
C
0
ST1[5]
G
A
A
$
C
T
ST1[6]
: TAGTACA$
A
0
0 1 2 3 4 5 6 7 8 9
G
T
G A
T
G
A
T
C
A
A
C
$
A
$
A
$
T
T
A
G
T
A
C
A
$
T
A
C
A
C
A
Edge_Merge
Node_Merge
ST1[5]:1,ST2[2]:
ST1[5], 2ST2[2]:1
$
Edge_Merge
0
1
0
2
1
3
7
4
7
5
4
6
9R
7
3
8
9
2R
10
11
12
13
14
15
ST1[0], ST2[0]
22
String: ATTAGTACA$
0 1 2 3 4 5 6 7 8 9
G
T
T
A
A
C
A
G
$
T
A
C
A
$
0
1
2
3
4
ST1: 0
7
1
5
4R
$
5
6
7
8
3
2R
1
4R
0
1
0
2
1
3
7
4
7
5
4
A
C
A
$
A
$
0
1
2
3
4
5
6
ST2: 6
5
5
7
9R
7
9R
T
C
A
$
T
A
G
T
A
A
$
$
G
T
C
C
$
A
A
A
$
C
T
G
T
A
6
9R
C
A
A
G A
T
G
A
T
C
A
A
C
$
A
$
A
$
T
T
A
G
T
A
C
A
$
T
A
C
A
C
A
Edge_Merge
$
ST1[5]:1,ST2[2]:
2
$
7
3
Edge_Merge
8
9
10
2R
10
7
11
4R
12
13
14
15
ST1[0], ST2[0]
23
String: ATTAGTACA$
0 1 2 3 4 5 6 7 8 9
G
T
T
A
A
C
A
G
$
T
A
C
A
$
0
1
2
3
4
ST1: 0
7
1
5
4R
$
5
6
7
8
3
2R
1
4R
1
0
12
2
1
3
7
4
7
5
4
A
C
A
$
A
$
0
1
2
3
4
5
6
ST2: 6
5
5
7
9R
7
9R
G
A
$
0
C
A
$
T
A
G
T
A
$
C
$
T
T
T
A
G
T
A
C
A
$
T
A
C
$
A
C G
A
T
$
A
C
A
$
A
$
C
T
G
T
A
6
9R
C
A
A
G A
T
G
A
T
C
A
A
C
$
A
$
A
$
T
T
A
G
T
A
C
A
$
T
A
C
A
C
A
$
$
7
3
Edge_Merge
8
9
10
2R
10
7
11
4R
12
1
13
14
7
4
15
9R
ST1[0], ST2[0]
24
Analysis of ST-Merge
„
Average Time Complexity
phase 1: O(nlog|A|(n/k))
phase 2: O(n2)
„
Further Minimize Random Accesses
phase 1: access to the string focuses on a small portion of the string
for each partition
phase 2: input trees and output trees are mostly sequentially
accessed; access to the string shows more locality
„
„
When the string size is significantly larger then main
memory, ST-Merge performs better than TDD.
When the string size is smaller than main memory, STMerge reduces to TDD.
25
Roadmap
„ Background
„
„
& Motivation
Our Approaches
¾
Top-Down Disk-Based Algorithm (TDD)
¾
Merge-Based Algorithm (ST-Merge)
Experiments and Analysis
26
Experimental Setup
„
Platform
Intel Pentium 4 2.8GHZ, 2 GB Main Memory, Maxtor
Atlas 10K IV SCSI Disk
„
Use raw device
Eliminate the effect of operating system buffering
27
Performance Comparison
Disk-based Construction
Data
Source
Symbols
(106)
Hunt
(min)
TDD
(min)
Speedup
Trembl
(Protein)
338
236.7
32.0
7.4
H.Chr1
(DNA)
227
97.50
17.83
5.5
Guten
(English)
407
463.3
46.67
9.9
HG
(DNA)
3,000
N/A
30hrs
N/A
TDD is 5-10 times faster!
28
TDD vs. ST-Merge
8000
ƒLimit the main
memory size to
6MB
ST−Merge
TDD Time
7000
Time (min)
6000
5000
ƒUniformly
distributed DNA
sequence data
from 10MB-80MB
4000
3000
2000
1000
0
0
20
40
60
80
100
String Size (MB)
29
Conclusion
„
„
„
„
„
Constructing large persistent suffix trees using
existing algorithms is impractical
We proposed 2 algorithms: TDD & ST-Merge
When string size is roughly same as the
memory size, TDD is faster than the best
known on-disk construction algorithm by 4x10x
ST-Merge beats TDD when the string size is
significantly larger (x3 or more) than main
memory
Using TDD and ST-Merge, large scale suffix
tree construction is now practical
30
Thanks!
Questions?
31
Buffer Allocation
14000
String Buffer
Suffixes Buffer
Temp Buffer
Tree Buffer
Disk Accesses
12000
10000
8000
6000
4000
2000
0
0
20
40
60
80
100
Buffer Size (% of File Size)
32
Main Memory Data Sources
Symbols
(106)
20
Data Sources
Description
Dmelano
Swp20
D. Melanogaster Chr.2
(DNA)
Gutenberg Project, Year
1995 (English Text)
Slice of SwissProt (protein)
Unif4
4-char alphabet, uniform dist. 20
Unif40
40-char alphabet, uniform
dist.
Guten95
20
20
20
33
Main-memory construction:
Cycle Time Breakdown
All data sources:
20 MB
34
On-Disk Data Sources
Data Sources
Description
Symbols(106)
Swp
Entire UniProt/SwissProt (Protein)
53
H.Chr1-50
50 MB slice of Human
Chromosome -1 (DNA)
50
Guten03
2003 Directory of Gutenberg
Project (English Text)
58
Trembl
TrEMBL (Protein)
338
H.Chr1
Entire Human Chromosome-1
(DNA)
227
Guten
Entire Gutenberg Collection
(English Text)
407
HG
Entire Human Genome (DNA)
3,000
35
On-Disk Performance Comparison
Data
Source
Symbols
(106)
Swp
53
H.Chr1-50
50
Guten03
58
Trembl
338
H.Chr1
227
Guten
407
HG
3,000
Hunt
(min)
13.95
11.47
22.5
263.7
97.50
463.3
N/A
TDD
(min)
2.78
2.02
6.03
32.0
17.83
46.67
30hrs
Speedup
5.0
5.7
3.7
7.4
5.5
9.9
N/A
Skew
(min)
3.88
3.21
3.94
>9hrs
>9hrs
>9hrs
N/A
36
Datasets for Buffer Management
Experiment
Data structure SwissProt (53MB) Human DNA(50MB)
(#pages)
(#pages)
String
6,250
6,250
Suffixes
1,250
6,250
Temp
1,250
6,250
Tree
4,100
16,200
Page size: 8KB, prefixlen=1
37
Buffer Size Simulation (DNA)
String Buffer
Suffix Buffer
38
Buffer Size Simulation (DNA)
Temp Buffer
Tree Buffer
39
Buffer Size Simulation (Protein)
String Buffer
Suffix Buffer
40
Buffer Size Simulation (Protein)
Temp Buffer
Tree Buffer
41
LCP Histogram
42
Brute-Force Algorithm
ATTAGT$
ATTAGT$
43
Brute-Force Algorithm
ATTAGT$
TT
T$
AG
AT
T
A
G
T
$
44
Brute-Force Algorithm
ATTAGT$
T
AT
T
A
G
T$
T$
AG
TA
G
T
$
45
Brute-Force Algorithm
ATTAGT$
A
$
G
T
$
T
AG
TA
TT
AG
T
$
GT
$
T
46
Brute-Force Algorithm
ATTAGT$
A
T
GT$
$
$
T
AG
T$
TAG
T
$
TT
A
G
T
G
47
Brute-Force Algorithm
ATTAGT$
A
T
GT$
AGT$
G
T
$
T$
TA
TT
AG
T
$
G
$
48
Brute-Force Algorithm
ATTAGT$
A
$
T
GT$
AGT$
G
T
$
T$
TA
TT
AG
T
$
G
$
49
TDD Architecture
50