Suf Þ X T Rees And Their Applications

2y ago
43 Views
2 Downloads
519.27 KB
35 Pages
Last View : 1d ago
Last Download : 2m ago
Upload by : Wren Viola
Transcription

Suffix Trees and their ApplicationsMoritz MaaßSeptember 13, 1999Technische Universität MünchenFakultät für InformatikAbstractSuffix trees have many different applications and have been studied extensively.This paper gives an overview of suffix trees and their construction and studies twoapplications of suffix trees.

CONTENTS2Contents1Introduction2Data Structures2.1 Notation . . . . . . . . . . . . . . . . . .2.2-Trees and suffix trees. . . . . . . . .2.3 Suffix Links and Duality of Suffix Trees2.4 Space-Requirements of Suffix Trees . . .3.444783McCreight’s Algorithm3.1 The Underlying Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3 Complexity of mcc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9910114Ukkonen’s Algorithm4.1 On-Line Construction of Atomic Suffix Trees . . . . . . . . . .4.2 From ast to ukk: On-Line Construction of Compact Suffix Trees4.3 Complexity of ukk . . . . . . . . . . . . . . . . . . . . . . . . . .4.4 Relationship between ukk and mcc . . . . . . . . . . . . . . . .13131417175Problems in the Implementation of Suffix Trees5.1 Taking the Alphabet Size into Account . . . . . . . . . . . . . . . . . .5.2 More Precise Storage Considerations . . . . . . . . . . . . . . . . . . .1919196Suffix Trees and the Assembly of Strings6.1 The Superstring Problem . . . . . . . . . . . . . . . . . . . . . . . . . .6.2 A G REEDY-Heuristic with Suffix Trees . . . . . . . . . . . . . . . . . .2020207Suffix Trees and Data Compression7.1 Simple Data Compression . . . . . . . . . . . . . . . . . . . . . . . . .7.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2323258Other Applications and Alternatives8.1 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8.2 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2626269Conclusion and Prospects27.A Data Compression Figures28B Suffix Tree Construction Steps31

31 IntroductionSuffix trees are very useful for many string processing problems. They are datastructures that are built up from strings and “really turn” the string “inside out”[GK97]. This way they provide answer to a lot of important questions in linear time(e.g. “Does text contain a word ?” in, the length of the word). There aremany algorithms that can build a suffix tree in linear time.The first algorithm was published by Weiner in 1973 [Wei73]. It reads a stringfrom right to left and successively inserts suffixes beginning with the shortest suffix,which leads to a very complex algorithm. Following Giegerich and Kurtz [GK97]the algorithm “has no practical value [.], but remains a true historic monument inthe area of string processing.” I will therefore not bother with a deeper study of thisearly algorithm.Shortly after Weiner McCreight published his “algorithm M” in [McC76]. The algorithm is explained in detail in section 3.A newer idea and a different intuition lead to an on-line algorithm by Ukkonen[Ukk95] that turns out to perform the same abstract operations as McCreight’s algorithm but with a different control structure, which leads to the on-line ability butalso to a slightly weaker performance [GK97]. The algorithm is explained in detailin section 4.I will begin with the explanation of the involved data structures (2). I will mostlyuse the notation and definitions of [GK97]. The next sections will explain the twoalgorithms (3),(4) and make some remarks on the implementation (5).At the end I will present two basic applications of suffix trees, string assembling(6), and data compression (7), and present some available alternatives to as well assome other applications for suffix trees (8).I will finish with a short conclusion (9).

2 DATA STRUCTURES42 Data Structures2.1 ftbranchingLet be a non empty set of characters, the alphabet. A (possibly empty) sequence ofwill denote the reversed stringcharacters from will be denoted like , , and .of . A single letter will be shown like , , or . is the empty string. The actualcharacters from will be denoted as , ,.For now the size of will be assumed constant and independently bound of thelength of any input string encountered.Letdenote the length of a string . Letbe all strings with length ,, and.A prefix of a string is a string such thatfor a (possibly empty) string . I. The prefix is called proper ifis not zero.will writeA suffix of a string is a string such thatfor a (possibly empty) string . Iwill write. The suffix is called proper ifis not zero.Similar, a factor of a string is a substring of string such thatfor (possiblyempty) strings and . I will write.andA factor of a string is called right-branching if can be decomposed asfor some strings , , , and , and where. A left-branching factor isdefined analogous.2.2-Trees and suffix trees.Definition 1. ( -tree) A-tree is a tree with aand edge labels from. Foreach, every node in has at most one edge, whose label starts with . For an edgefrom to with label we will write.!"# %&root'(""!!!!""!!!!"""!!!#!!!!-./0)* ,-./0)* ,-./0)* ,3 &1 2# % &&&&## "%&"# # %'-./0)* ,-./0-./0)* ,-./0)* ,-./0)* ,-./0)* ,568974 ( )* ,(('(')56781234 ?@9:; 1011Figure 1: Example of alocationsubtree-treeIf is a-tree with the node then we letbe the string that is the concatenation of all edge labels from theto . In the example in figure 1 the paths ofand. We will call the locationnode and node are.of , that iswe can denote a node. In theSince every branch is unique, ifrefers to node . We let the subtree at node be.example

2.2-Trees and suffix trees.5The words that are represented in a-tree are given by the set. A wordis inif and only if there is a such thatis a node in .If a string is insuch thatand is a node in we will callthe reference pair of with respect to . If is the longest prefix such thatthe canonical reference pair. We will then writeis a reference pair, we will call.In our exampleis a reference pair forandis the canonicalreference pair.A locationis called explicit ifand implicit otherwise.!"# %&'(!! root"""!!!""!!!""#!!!!!!!-./0)* ,-./0)* ,-./0)* ,3 &21 # # % &&&&# " %&# #' %-./0)* ,-./0)* ,-./0)* ,-./0)* ,-./0)* ,56894((('(')56781234 ?@9:; 1011Figure 2: The compactcanonicalreferencepairexplicitimplicit-tree for figure 1Definition 2. (Atomic and Compact-tree) A-tree where every edge label consistsonly of a single character is called atomic (every location is explicit).A-tree where every node is either, a leaf or a branching node is called compact.!"root'(# %&**))**)**)))))**)))* ")))*)-./0)* ,-./0)* ,-./0)* ,321 &,& ,, &&- . &.,, && ./&&, ".-'-./0)* ,56781234-./0)* ,56781234-./0)* ,56781234685.17.19.14// 00000/0//""1"56781234 ?@9:; 5678123456781234567812349.2107.211 5.2"123456785.3Figure 3: The atomicreferencepair-tree for figure 1Atomic-trees are also known as “tries” [Ukk95, UW93]. Atomic and compact-trees are uniquely determined by the words they contain.atomiccompact

2 DATA STRUCTURES6Figure 2 and figure 3 show the compact and the atomicof figure 1.suffix treereverseprefix treenestedsuffixactivesuffix-tree for the example tree-tree such thatDefinition 3. (Suffix Tree) A suffix tree for a string is ais a factor of . For a string the atomic suffix tree will be denoted bycompact suffix tree will be denoted by., the.The reverse prefix tree of a string is the suffix tree ofA nested suffix of is a suffix that appears also somewhere else in . The longestnested suffix is called the active suffix of .Lemma 1. (Explicit locations in the compact suffix tree) A locationcompact suffix treeif and only if1.is a non nested suffix of or2.is right branching.is explicit in the(inProof. “ ”: If is explicit, it can either be a leaf or a branching node orwhich caseand is a nested suffix of ).If is a leaf it is also a suffix of . Then it must also be a non nested suffix, since if it. The node canwere nested, it would appear somewhere else in :then not be a leaf.If is a branching node then there must exist at least two outgoing edges fromwith different labels. This means that there must exist two different suffixes , ofwithand, whereandwith. Hence w isright branching.“ ”: If is a non nested suffix of , it must be a leaf. If is right branching thereare two suffixes , of withand, where, and is abranching node.openedgessuffix linkNow it is easy to see, why the decision whether a word occurs in string can bedone in, since one just has to check if is an (implicit) location in.The edge labels must be represented by pointers into the string to keep the size ofthe compact suffix tree to(see section 2.4). The edge labeldenotes thesubstringof or the empty string if.Ukkonen [Ukk95] introduced so called open edges for leaf edges (edges that leadto a leaf). Since all leaves of a suffix tree extend to the end of the string, an openinstead of, where is always the maximal length,edge is denoted bythat is . This way all leaves extend automatically to the right, when the string isextended.-tree. Letbe a node inDefinition 4. (Suffix Links) Let be alongest suffix of , such that is a node in . An (unlabeled) edge fromit is called atomic.link. IfProposition 1. Inand, whereand let be theto is a suffix, all suffix links are atomic.

2.3 Suffix Links and Duality of Suffix Treessentinelcharacter7Proof. The character is called a sentinel character. The first part follows from thedefinition, since all locations are explicit. To prove the second proposition we must, is also a node in. Ifis a node in, itshow that for each nodemust be either a leaf or a branching node. If it is a leaf, must be a non nested suffix,of . Because of the sentinel character, following lemma 1 all suffixes (includingthe empty suffix) are explicit, since onlyis a nested suffix. Therefore is also aleaf or. Ifis a branching node, thenis right branching and so is . Henceis explicit by lemma 1.As follows from this proof, the sentinel character guarantees the existence of leavesfor all suffixes. With the sentinel there can be no nested suffixes but the emptysuffix of . If we drop the sentinel character some suffixes might be nested and theirlocations become implicit.2.3 Suffix Links and Duality of Suffix TreesThe suffix links of a suffix treetree bylabeledform a-tree by themselves. We will denote this. It has a nodefor each node in and an edge fromfor each suffix link fromto in .Proposition 2.is ato-tree.Proof. By contradiction: Suppose there is a nodein the suffix link treethathas two -edges. Then there must be two suffix links in the suffix tree fromto and fromto . Hereandbecause ifis explicit, the suffixlinks would point there instead of to .orare inner nodes, thenoris right branching in . But thenIfmust also be right branching and be an explicit location, which is a contradiction.andare leaves,andmust be suffixes of and soIforin which case the suffix link frommust point toor viceversa.Traversing the suffix link chain from toyields a path that is a factor of.Therefore the suffix link treecontains a subset of all words of the suffix tree of.Proposition 3.Proof. All suffix links ofdeduce:is an edge inare nodesis an edgeandinare atomic by proposition 1. Therefore we can simply, iff, iff there are nodesin.is a suffix link inandin, iff there, iff thereFor the compact suffix tree there is the weaker duality that is proven in [GK97].Proposition 4.

2 DATA STRUCTURES8Proof. See [GK97], proposition 2.12 (3)The further exploitation of this duality leads to affix trees that are studied in detailin [Sto95]. Unfortunately the aim of constructing and proving anon-line algorithm for constructing an affix tree is not achieved by Stoye.2.4 Space-Requirements of Suffix TreesProposition 5. Compact suffix trees can be represented inspace.Proof. A suffix tree contains at most one leaf per suffix (exactly one with the sentinelcharacter). Every internal node must be a branching node, hence every internalnode has at least two children. Each branch increases the number of leaves by atleast one, we therefore have at most internal nodes and at most leaves.To represent the string labels of the edges we use indices into the original string asdescribed above. Each node has at most one parent and so the total number of edgesdoes not exceed .Similar each node has at most one suffix link, so the total number of suffix links isalso restricted by .As an example of a suffix tree withnodes consider the tree for.The size of the atomic suffix tree for a string is. (For an example see theatomic suffix tree for, which hasnodes.)affix trees

93 McCreight’s Algorithm3.1 The Underlying IdeaI will first try to give the intuition and then describe McCreight’s algorithm (mcc) infurther detail.mcc starts with an empty tree and inserts suffixes starting with the longest one.(Therefore mcc is not an on-line algorithm.)mcc requires the existence of the sentinel character so that no suffix is the prefix ofanother suffix and there will always be a terminal node per suffix.For the algorithm we will defineto be the current suffix (in step ), andto be the longest prefix of, which is also a prefix of another suffix, where.is defined as.The key idea to mcc is that, since we insert suffixes of decreasing lengths, there is aandthat can be used:relation betweenLemma 2. Iffor letterand (possibly empty) string. Then there is aand, henceProof. Suppose. ThenWe know the location ofmove to the location ofmight not be explicit (iflink might not yet be set for, thenso that.andand if we had the suffix links, we could easilywithout having to find our way from. Butwasn’t explicit in the previous step) and the suffix.!"root'(# %&unvisited pathunvisited path22 34 5followed)* ,-./05 5suffix link 3 )* ,a4 3 2 -./0d116// /711rescanned path//11// path, followed up476/0/8 )* ,: 9-./0new suffixlink2e92-./0)* ,b 6 7 8 9 : 286scanned path5EFGHABCDf.:-./0)* ,c;-./0)* ,gFigure 4: Suffix Link Usage by mccThe solution found be McCreight is to findby two steps of “rescanning” anduntil we find a suffix link, follow“scanning”. We just walk up the tree fromit and then rescan the path we walked up back down to the location of (which is

3 MCCREIGHT’S ALGORITHM10easy because we know the length of and that its location exists, so we don’t haveto read the complete edge labels moving down the tree, but we can just check thestart letter and the length).Figure 4 shows this idea. Instead of trying to find the path fromto node , thealgorithm moves up to , takes the suffix link to , rescans its way to the (possiblyimplicit) location , and only has to find the way to letter by letter.3.2 The AlgorithmThe algorithm has three phases. First it determines the structure of the old header,finds the next available suffix link and follows it. Secondly it rescans the part of theprevious header for which the length is known (called ). Thirdly it sets the suffixlink for, scans the rest of(called ) and inserts a new leaf for.Abranching node is constructed in the second phase of rescanning, if the location ofdoes not exist. In this case no scanning is needed, because ifwere longer,would be right branching, but because of lemma 2is also rightthanbranching, so a node must already exist at. A node is constructed in the thirdis not yet explicit.phase, if the locationAlgorithm mccis the given string.1:;2:;3:;4: forto do5:find , , such thata.,b. if the parent ofis not root, it isc.and.6:ifthen7:follow the suffix link fromto ;8:end if9:;10:set suffix link fromto;11:;12:add leaf for;13: end for, otherwise,thenand is known equally fast as by taking the suffixNote that iflink in line 7.finds the location of. Ifis not explicit yet, a new node isProcedureinserted. This case only occurs when the complete head is already scanned: If themust be the prefix of more thanhead is longer (and the node already exists),is only explicit, if it is a branchingtwo suffixes and also left-branching in . Thewere not left-branching thenmust have been longernode already, and ifbecause it would have met the longer prefix.

3.3 Complexity of mccProcedureis a node and1:;2: whiledo3:find edge4:if5:11a string.withthen;6:split with new node7:return ;8:end if9:;10:;11: end while12: return ;Proceduretion.;and edgesandsearches deeper into the tree until it falls out and returns that posi-Procedureis a node and a string.1:;2: whilewithdo3:;4:whileanddo5:;6:;7:end while8:ifthen9:10:else11:split with new node12:return13:end if14: end while15: returnand edgesand3.3 Complexity of mccProposition 6. mcc takes time. Algorithm mcc’s main loop exeProof. Let be our underlying string andcutes times, each step except forin line 8 andin line 10 takes constanttime.takes time proportional to the number of nodesit visits. The scanningof the current suffix(). Because everytakes place in a suffixnode encountered in rescanning already has a suffix link, there will be a non-empty

3 MCCREIGHT’S ALGORITHM12string(the edges label) that is inbut not infor each node. Therefore. Henceand all calls totake time.cannot simply skip from node to node, but the characters in between are. Thenis the number of characlooked at one by one. Letters scanned. By definition of and. Hence the totalnumber of characters scanned in all calls tois.

134 Ukkonen’s AlgorithmUkkonen developed hissuffix tree algorithm from a different (and more intuitive) idea. He was working on string matching automata and his algorithm isderived from one that constructs the atomic suffix tree (sometimes referred to as“trie”). I will therefore shortly describe that algorithm and then derive Ukkonen’salgorithm.4.1 On-Line Construction of Atomic Suffix TreesAs shown in section 2.4 the atomic suffix tree has size. Therefore any algorithm needs at least that time to build the atomic suffix tree. The following algorithmlengthens the suffixes by one letter in each step. Every intermediate tree after stepsis the atomic suffix tree for. The algorithm starts inserting new leaves atthe longest suffix, and follows the suffix links to the shorter suffixes until it reachesa node, where the corresponding edge already exists. To ensure termination of thisloop an extra node, the “negative” position of all characters, is introduced. Wehave a suffix link fromtoand a labeled edge fromtofor every.Algorithm astis the given string.1:;2: addand to .3: for alldo4:add edge5: end for6:7:8:9:10: for11:12:;;;;todo;whileadd edgeif13:14:15:16:else17:18:end if19:20:21:end while22:23: end fordowith new nodethen;;;;;

4 UKKONEN’S ALGORITHM144.2 From ast to ukk: On-Line Construction of Compact Suffix TreesboundarypathactivepointendpointThe above given algorithm will now be transfered into analgorithm for conast adds a new node and a new edge tostructing compact suffix trees. In stepall nodes on the boundary path. The boundary path is the path from the longest suffixalong the suffix links until alabeled edge is found. Let be the factorof, and let be its location in, withand. Let be thesmallest index, such that for all nodes ,, a node and an edge is inserted. Nowlet be the smallest index, such that for all nodes ,, is a leaf. After stepthe longest suffixis a leaf andalways has a labeled edge, soit is clear thatin step.We callthe active point and the endpoint.is the active suffix of(thelongest nested one). ast inserts two different kinds of -edges: Until the active pointthese are leaves that extend a branch, after that each edge starts a new branch. Whenapplying this to the construction of compact suffix trees one Ukkonen’s key idea wasto introduce open edges as described on page 6. This way the edges would expandautomatically with the string and needed not be added by hand.)Now we just have to describe how to update the active point (that is initiallyand how to add the branching nodes and new leaves from there along the boundarypath to the endpoint. Letbe the com

6 2 D A TA STRUCTURES Figur e 2 and Þ gur e 3 show the compact and the atomic -tree for the example tree of Þ gur e 1. suf Þx tree De Þ nition 3. (Suf Þ x Tree) A suf Þ x tree for a string is a -tree such that is a factor of . For a string the atomic suf Þ x tree will be denoted by , the

Related Documents:

REEs and Electronics REEs have been used in electronics and advanced machinery for nearly three-quarters of a century. Demand for REEs in electronics began in earnest in the 1960s with the introduction of the first color television sets, which initially used europium to produce the color images on the screen.15 Since then,

4/ Les adaptateurs et les seringues avec des joints en caoutchouc doivent être Page 3. . CASTELLINI Ancien modèle CASTELLINI SUF 1 (NV) CASTELLINI SUF 1 200 186 Old type STILLFLEX/TRIX/POLO SKEMA CASTELLINI (NV) CASTELLINI V 200 163 CLEAN WARM 2 SOLARE GOLDSIX CASTELLINI (NV) CASTELLINI C 200 162 FIRMA LOGOS CAST

Some basic accent patterns in nouns 1. always on the same stemvowel 2. on an accented suf,ix, else the ,irstsyllable 3. always on the :irst suf,ix vowel 26 koróv-a borod-á gospož-á nom.sg. koróv-ɨ bórod-ɨ gospož-ı̵́ nom.pl. ‘cow’ ‘beard’ ‘lady’ Last class is

ERL@APS Stage 1 ERL (APS beamlines only) 25 1300 0.013 7 2 4.8 2 x 10 . John Galayda Stanford Linear Accelerator Center: California Georg Hoffstaetter Cornell University: New York Andrew Hutton Thomas Jefferson National Accelerator Facility: Virginia Sam . Guess who has 45 years of experience at ANL? SUF Employee Advisory Committee (SUF-EAC

CASSANDRA CLARE and ROBIN WASSERMAN Shadow Market Enterprises, Inc. Amherst, MA · Los Angeles, CA. Ghosts of the Shadow Market 1. Son of the Dawn by Cassandra Clare and Sarah Rees Brennan 2. Cast Long Shadows by Cassandra Clare and Sarah Rees Brennan 3. Every Exquisite Thing by Cassandra Clare and Maureen Johnson 4. Learn About Loss by Cassandra Clare and Kelly Link 5. A Deeper Love by .

We aim to cover the ground between the standard intermediate micro course, taught largely in two dimensions with little explicit use of mathematics, and the advanced doctoral course. This book is meant to fit between, say, Hal Varian’s excellent Intermediate Microeconomics, and Mas-Colell, Whinston and Green’s magisterial Microeconomic .

Sciences and Disorders Audiology Clinic Resource Manual was developed to serve as a quick reference for student clinician and clinical instructor use with the intent of guiding the beginning clinical practitioner while serving patients in the Audiology Clinic at the Maryjane Rees

API RP 505 «API RP 505 « Recommended Practice for classification of locations for ElectricalRecommended Practice for classification of locations for Electrical Installations at Petroleum facilities classified as Class I, zone 0, zone1, zone2 » Foreword states : « API publications may be used by anyone desiring to do so. Every effort has been made by the Institute to assure the accuracy and .