Practical Methods for Constructing Suffix Trees Advisor: Jignesh M. Patel Yuanyuan Tian 1 Roadmap Background & Motivation Our Approaches ¾ Top-Down Disk-Based Algorithm (TDD) ¾ Merge-Based Algorithm (ST-Merge) Experiments and Analysis 2 Background Sequence data sets are ubiquitous in modern life science and traditional text applications, and querying sequences is a common and critical operation in these applications. Suffix trees are versatile data structures that can help execute a variety of sequence matching queries efficiently. e.g. pattern matching, biological sequence alignment (OASIS, MUMer, REPuter, etc.) 3 Suffix Tree String: ATTAGTACA$ Suffixes: 0 1 2 3 4 5 6 7 8 9 S0: ATTAGTACA$ S1: TTAGTACA$ S5: TACA$ S6: ACA$ S7: CA$ S8: A$ A $ A C G T G C A $ T A G T A T A C A A $ $ C G A T $ A C A $ A C S4: GTACA$ T T T A G T A C A $ S3: AGTACA$ $ A S2: TAGTACA$ C A $ $ S9: $ Storage requirement of a suffix tree is O(n). 4 Existing Algorithms In-Memory Algorithms Quadratic algorithms ¾ ¾ ¾ Brute-force -- Insert a suffix at a time WOTD -- good locality of reference Worst case O(n2), average O(nlogn) Linear time algorithms ¾ ¾ ¾ ¾ Best-known -- Ukkonen, McCreight, Weiner Use suffix links Poor locality of memory reference Only suited for small input size 5 Existing Algorithms On-Disk Algorithms Practical Algorithms ¾ TOP-Q ¾ Hunt’s algorithm -- the best known practical disk-based algorithm Buffering strategy to improve the performance of Ukkonen’s algorithm Construct different parts of the suffix tree independently Use Brute-Force -- random access to the tree Theoretical Algorithms ¾ ¾ Reduce complexity to sorting, but tricky to implement Build a suffix array from which a suffix tree is built SKEW -- linear suffix array construction algorithm 6 Existing Algorithms Human Chromosome 1 (227MB) 25 20 15 hrs 10 5 0 Ukkonen Hunt TDD >1 day ~1.5hrs ~17mins 7 Motivation Many sequence datasets are growing at exponential rates. (e.g. GenBank x2/16mo.) Existing techniques can only deal with small trees, for big datasets they are impractical. 8 Roadmap Background & Motivation Our Approaches ¾ Top-Down Disk-Based Algorithm (TDD) ¾ Merge-Based Algorithm (ST-Merge) Experiments and Analysis 9 Suffix Tree Representation String: ATTAGTACA$ Suffixes: Alphabet Order: A<T<C<G<$ 0 1 2 3 4 5 6 7 8 9 $ A S4: GTACA$ S5: TACA$ S6: ACA$ A $ S7: CA$ A C C A $ T A G T A T G T A S8: A$ C A S9: $ 4 9R $ 3 A 8 9 10 2R 7 4R 1 7 4 9R GTACA$ 7 7 CA$ T 6 10 $ TTAGTACA$ 7 5 $ GTACA$ A 1 4 GTACA$ 12 3 CA$ 0 2 A CA$ 1 C TAGTACA$ 0 A $ C G A T $ A C A $ S3: AGTACA$ G T T T A G T A C A $ S2: TAGTACA$ $ S1: TTAGTACA$ A C S0: ATTAGTACA$ 11 12 13 14 15 $ 10 Top-Down Disk-Based Algorithm (TDD) Leverage partitioning to make use of mainmemory efficiently and reduce disk I/Os ¾ Minimize random references and try to use sequential access ¾ Use Hunt’s partitioning strategy Based on WOTD -- good locality behavior concerning tree access Manage buffers for large structures so that disk I/Os are reduced 11 Data Structures ATTAGTACA$ String 0 12 1 7 7 Tree (8.5x) S0: ATTAGTACA$ S0: ATTAGTACA$ S3: AGTACA$ S1: TTAGTACA$ S6: ACA$ S2: TAGTACA$ S8: A$ S3: AGTACA$ S1: TTAGTACA$ S4: GTACA$ S2: TAGTACA$ S5: TACA$ S5: TACA$ S6: ACA$ S7: CA$ S7: CA$ S4: GTACA$ S8: A$ S9: $ S9: $ Suffixes (4x) … Temp (4x) 12 TDD Execution Phase 1: partitioning the suffixes of the input string into |A|prefixlen partitions S0: ATTAGTACA$ S0: ATTAGTACA$ S1: TTAGTACA$ S3: AGTACA$ S2: TAGTACA$ prefixlen=1 S6: ACA$ S3: AGTACA$ S8: A$ S4: GTACA$ S1: TTAGTACA$ S5: TACA$ S2: TAGTACA$ S6: ACA$ S5: TACA$ S7: CA$ S7: CA$ S8: A$ S4: GTACA$ S9: $ S9: $ Time Complexity O(n*prefixlen) 13 TDD Execution Phase 2 String: ATTAGTACA$ Longest Common Prefix S1: TTAGTACA$ LCP=T S3: AGTACA$ S5: TACA$ S6: ACA$ A LCP=A T C G T A $ S4: GTACA$ S6: ACA$ S7: CA$ S2: TAGTACA$ A C A $ 0 1 2 3 4 5 6 1 2 3 5 2R 7 4R GTACA$ A A S3: AGTACA$ CA$ G T S3, S6S5 S1, S2, TAGTACA$ $ C S2: TAGTACA$ S2: TAGTACA$ T A 0 1 2 3 4 5 6 7 8 9 T A Average Time Complexity: O(nlog|A|n) S7: CA$ S4: GTACA$ 14 Buffer Management Replacement Policy Data Structure String Access Pattern Replacement Policy Random Random Suffixes Less Random Random/LRU Temp Sequential MRU Tree Mostly Sequential LRU 15 Buffer Management Buffer Allocation String Suffix Temp Tree Dataset Size Small Dataset Medium Dataset Large Dataset 0% 20% 40% 60% 80% 100% Memory allocated (%) 16 Merge-Based Algorithm (ST-Merge) When string size is larger than main memory, the performance of TDD degrades rapidly. Reduce random accesses to large string by partitioning suffixes so that each partition contains adjacent suffixes. 17 ST-Merge Algorithm Suffixes S0,S1,S2……………………………………………………… …………………….. Sn-1,Sn Phase 1 Use TDD without partitioning Merge Phase 2 Merged 2 subroutines: NodeMerge,EdgeMerge (stack-based recursive routines) Tree 18 A T T A G T A C A $ G A T G A T C A A C $ A $ S9: $ S8: A$ S7: CA$ S6: ACA$ S5: TACA$ S4: GTACA$ ST1 S3: AGTACA$ 0 1 2 3 4 5 6 7 8 9 S2: TAGTACA$ String: ATTAGTACA$ S1: TTAGTACA$ S0: ATTAGTACA$ Phase 1 ST2 $ A T G T A T C A A G $ T A C A $ $ A C T A C A $ $ C A $ 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 7 1 5 4R 3 2R 1 4R 6 5 5 7 9R 7 9R Access to String: ATTAGTACA$ ATTAGTACA$ Better locality of reference to the string! 19 String: ATTAGTACA$ 0 1 2 3 4 ST1: 0 7 1 5 4R T A A C A G $ T A C A $ $ 5 6 7 8 3 2R 1 4R $ C A $ C A $ 0 1 2 3 4 5 6 ST2: 6 5 5 7 9R 7 9R ST2[0,1]: A T T A T A C $ A $ ST1[0,1]: A G A $ C A 0 1 2 3 4 5 6 7 8 9 G T G A T G A T C A A C $ A $ ST1[2,3]: T A T T A G T A C A $ T ST2[2] : TACA$ $ A A C Phase 2 ST2[3] : CA$ ST1[4] : GTACA$ ST2[4] : $ Edge_Merge ST1[2], ST2[2] Edge_Merge Node_Merge 0 1 0 2 1 3 4 7 5 4 6 9R 7 8 9 10 11 12 13 14 15 ST1[0], ST2[0] 20 String: ATTAGTACA$ 1 2 3 4 ST1: 0 7 1 5 4R T A A C A G $ T A C A $ $ 5 6 7 8 3 2R 1 4R $ C A $ $ A C A $ A $ 0 1 2 3 4 5 6 ST2: 6 5 5 7 9R 7 9R ST1[2,3]: T T T C LCP=T G A A $ C T ST2[2] : TACA$ A 0 0 1 2 3 4 5 6 7 8 9 G T G A T G A T C A A C $ A $ A $ T T A G T A C A $ T A C A Node_Merge Edge_Merge ST1[5],ST2[2]:1 ST1[2], ST2[2] Edge_Merge 0 1 0 2 1 3 4 7 5 4 6 9R 7 8 9 10 11 12 13 14 15 ST1[0], ST2[0] 21 String: ATTAGTACA$ 1 2 3 4 ST1: 0 7 1 5 4R T A A C A G $ T A C A $ $ 5 6 7 8 3 2R 1 4R $ A C A $ T A G T A $ A C A $ A $ 1 2 3 4 5 6 ST2: 6 5 5 7 9R 7 9R : AGTACA$ ST2[2](1): ACA$ T T C 0 ST1[5] G A A $ C T ST1[6] : TAGTACA$ A 0 0 1 2 3 4 5 6 7 8 9 G T G A T G A T C A A C $ A $ A $ T T A G T A C A $ T A C A C A Edge_Merge Node_Merge ST1[5]:1,ST2[2]: ST1[5], 2ST2[2]:1 $ Edge_Merge 0 1 0 2 1 3 7 4 7 5 4 6 9R 7 3 8 9 2R 10 11 12 13 14 15 ST1[0], ST2[0] 22 String: ATTAGTACA$ 0 1 2 3 4 5 6 7 8 9 G T T A A C A G $ T A C A $ 0 1 2 3 4 ST1: 0 7 1 5 4R $ 5 6 7 8 3 2R 1 4R 0 1 0 2 1 3 7 4 7 5 4 A C A $ A $ 0 1 2 3 4 5 6 ST2: 6 5 5 7 9R 7 9R T C A $ T A G T A A $ $ G T C C $ A A A $ C T G T A 6 9R C A A G A T G A T C A A C $ A $ A $ T T A G T A C A $ T A C A C A Edge_Merge $ ST1[5]:1,ST2[2]: 2 $ 7 3 Edge_Merge 8 9 10 2R 10 7 11 4R 12 13 14 15 ST1[0], ST2[0] 23 String: ATTAGTACA$ 0 1 2 3 4 5 6 7 8 9 G T T A A C A G $ T A C A $ 0 1 2 3 4 ST1: 0 7 1 5 4R $ 5 6 7 8 3 2R 1 4R 1 0 12 2 1 3 7 4 7 5 4 A C A $ A $ 0 1 2 3 4 5 6 ST2: 6 5 5 7 9R 7 9R G A $ 0 C A $ T A G T A $ C $ T T T A G T A C A $ T A C $ A C G A T $ A C A $ A $ C T G T A 6 9R C A A G A T G A T C A A C $ A $ A $ T T A G T A C A $ T A C A C A $ $ 7 3 Edge_Merge 8 9 10 2R 10 7 11 4R 12 1 13 14 7 4 15 9R ST1[0], ST2[0] 24 Analysis of ST-Merge Average Time Complexity phase 1: O(nlog|A|(n/k)) phase 2: O(n2) Further Minimize Random Accesses phase 1: access to the string focuses on a small portion of the string for each partition phase 2: input trees and output trees are mostly sequentially accessed; access to the string shows more locality When the string size is significantly larger then main memory, ST-Merge performs better than TDD. When the string size is smaller than main memory, STMerge reduces to TDD. 25 Roadmap Background & Motivation Our Approaches ¾ Top-Down Disk-Based Algorithm (TDD) ¾ Merge-Based Algorithm (ST-Merge) Experiments and Analysis 26 Experimental Setup Platform Intel Pentium 4 2.8GHZ, 2 GB Main Memory, Maxtor Atlas 10K IV SCSI Disk Use raw device Eliminate the effect of operating system buffering 27 Performance Comparison Disk-based Construction Data Source Symbols (106) Hunt (min) TDD (min) Speedup Trembl (Protein) 338 236.7 32.0 7.4 H.Chr1 (DNA) 227 97.50 17.83 5.5 Guten (English) 407 463.3 46.67 9.9 HG (DNA) 3,000 N/A 30hrs N/A TDD is 5-10 times faster! 28 TDD vs. ST-Merge 8000 Limit the main memory size to 6MB ST−Merge TDD Time 7000 Time (min) 6000 5000 Uniformly distributed DNA sequence data from 10MB-80MB 4000 3000 2000 1000 0 0 20 40 60 80 100 String Size (MB) 29 Conclusion Constructing large persistent suffix trees using existing algorithms is impractical We proposed 2 algorithms: TDD & ST-Merge When string size is roughly same as the memory size, TDD is faster than the best known on-disk construction algorithm by 4x10x ST-Merge beats TDD when the string size is significantly larger (x3 or more) than main memory Using TDD and ST-Merge, large scale suffix tree construction is now practical 30 Thanks! Questions? 31 Buffer Allocation 14000 String Buffer Suffixes Buffer Temp Buffer Tree Buffer Disk Accesses 12000 10000 8000 6000 4000 2000 0 0 20 40 60 80 100 Buffer Size (% of File Size) 32 Main Memory Data Sources Symbols (106) 20 Data Sources Description Dmelano Swp20 D. Melanogaster Chr.2 (DNA) Gutenberg Project, Year 1995 (English Text) Slice of SwissProt (protein) Unif4 4-char alphabet, uniform dist. 20 Unif40 40-char alphabet, uniform dist. Guten95 20 20 20 33 Main-memory construction: Cycle Time Breakdown All data sources: 20 MB 34 On-Disk Data Sources Data Sources Description Symbols(106) Swp Entire UniProt/SwissProt (Protein) 53 H.Chr1-50 50 MB slice of Human Chromosome -1 (DNA) 50 Guten03 2003 Directory of Gutenberg Project (English Text) 58 Trembl TrEMBL (Protein) 338 H.Chr1 Entire Human Chromosome-1 (DNA) 227 Guten Entire Gutenberg Collection (English Text) 407 HG Entire Human Genome (DNA) 3,000 35 On-Disk Performance Comparison Data Source Symbols (106) Swp 53 H.Chr1-50 50 Guten03 58 Trembl 338 H.Chr1 227 Guten 407 HG 3,000 Hunt (min) 13.95 11.47 22.5 263.7 97.50 463.3 N/A TDD (min) 2.78 2.02 6.03 32.0 17.83 46.67 30hrs Speedup 5.0 5.7 3.7 7.4 5.5 9.9 N/A Skew (min) 3.88 3.21 3.94 >9hrs >9hrs >9hrs N/A 36 Datasets for Buffer Management Experiment Data structure SwissProt (53MB) Human DNA(50MB) (#pages) (#pages) String 6,250 6,250 Suffixes 1,250 6,250 Temp 1,250 6,250 Tree 4,100 16,200 Page size: 8KB, prefixlen=1 37 Buffer Size Simulation (DNA) String Buffer Suffix Buffer 38 Buffer Size Simulation (DNA) Temp Buffer Tree Buffer 39 Buffer Size Simulation (Protein) String Buffer Suffix Buffer 40 Buffer Size Simulation (Protein) Temp Buffer Tree Buffer 41 LCP Histogram 42 Brute-Force Algorithm ATTAGT$ ATTAGT$ 43 Brute-Force Algorithm ATTAGT$ TT T$ AG AT T A G T $ 44 Brute-Force Algorithm ATTAGT$ T AT T A G T$ T$ AG TA G T $ 45 Brute-Force Algorithm ATTAGT$ A $ G T $ T AG TA TT AG T $ GT $ T 46 Brute-Force Algorithm ATTAGT$ A T GT$ $ $ T AG T$ TAG T $ TT A G T G 47 Brute-Force Algorithm ATTAGT$ A T GT$ AGT$ G T $ T$ TA TT AG T $ G $ 48 Brute-Force Algorithm ATTAGT$ A $ T GT$ AGT$ G T $ T$ TA TT AG T $ G $ 49 TDD Architecture 50