String Edit Distance (and Intro To Dynamic Programming)

1y ago
3 Views
1 Downloads
800.84 KB
39 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Sutton Moon
Transcription

String Edit Distance(and intro to dynamic programming)Lecture #4Computational LinguisticsCMPSCI 591N, Spring 2006University of Massachusetts AmherstAndrew McCallumAndrew McCallum, UMass Amherst,including material from William Cohen

Dynamic Programming (Not much to do with “programming” in theCS sense.) Dynamic programming is efficient in findingoptimal solutions for cases with lots ofoverlapping sub-problems. It solves problems by recombining solutionsto sub-problems, when the sub-problemsthemselves may share sub-sub-problems.Andrew McCallum, UMass Amherst,including material from William Cohen

Fibonacci Numbers1 1 2 3 5 8 13 21 34 .Andrew McCallum, UMass Amherst,including material from William Cohen

1Andrew McCallum, UMass Amherst,including material from William Cohen

Calculating Fibonacci NumbersF(n) F(n-1) F(n-2),where F(0) 0, F(1) 1.Non-Dynamic Programming implementationdef fib(n):if n 0 or n 1:return nelse:return fib(n-1) fib(n-2)For fib(8), how many calls to function fib(n)?Andrew McCallum, UMass Amherst,including material from William Cohen

DP Example:Calculating Fibonacci NumbersDynamic Programming: avoid repeated calls by rememberingfunction values already calculated.table {}def fib(n):global tableif table.has key(n):return table[n]if n 0 or n 1:table[n] nreturn nelse:value fib(n-1) fib(n-2)table[n] valuereturn valueAndrew McCallum, UMass Amherst,including material from William Cohen

DP Example:Calculating Fibonacci Numbers.or alternately, in a list instead of a dictionary.def fib(n):table [0] * (n 1)table[0] 0table[1] 1for i in range(2,n 1):table[i] table[i-2] table[i-1]return table[n]We will see this pattern many more times in this course:1. Create a table (of the right dimensions to describe our problem.2. Fill the table, re-using solutions to previous sub-problems.Andrew McCallum, UMass Amherst,including material from William Cohen

String Edit DistanceGiven two strings (sequences) return the “distance”between the two strings as measured by.the minimum number of “character edit operations”needed to turn one sequence into the other.AndrewAmdrewz1. substitute m to n2. delete the zDistance 2Andrew McCallum, UMass Amherst,including material from William Cohen

String distance metrics: Levenshtein Given strings s and t– Distance is shortest sequence of editcommands that transform s to t, (or equivalently sto t).– Simple set of operations: Copy character from s over to tDelete a character in sInsert a character in tSubstitute one character for another This is “Levenshtein distance”Andrew McCallum, UMass Amherst,including material from William Cohen(cost 0)(cost 1)(cost 1)(cost 1)

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”)sW IL L gap IA M C O H E NalignmenttW I L L L I A M C O H O NeditC C C C IC C C C C C C S Copcost 0 0 0 0 1 1 1 1 1 1 1 1 2 2so far.Andrew McCallum, UMass Amherst,including material from William CohenAlignment is a little bit like a parse.

Finding the MinimumWhat is the minimum number of operations for.?Another fine day in the parkAnybody can see him pick the ballNot so easy. not so clear.Not only are the strings, longer, but is isn’t immediatelyobvious where the alignments should happen.What if we consider all possible alignments by brute force?How many alignments are there?Andrew McCallum, UMass Amherst,including material from William Cohen

Dynamic Program Table for String EditPARKMeasure distance between stringsSPAKEPARKSPAKEAndrew McCallum, UMass Amherst,including material from William Cohencijcij the number of editoperations neededto align PA withSPA.

Dynamic Programming to the Rescue!How to take our big problem and chop it into building-block pieces. Given some partial solution, it isn’t hard to figureout what a good next immediate step is. Partial solution “This is the cost for aligning s up to position iwith t up to position j. Next step “In order to align up to positions x in s and y int, should the last operation be a substitute,insert, or delete?Andrew McCallum, UMass Amherst,including material from William Cohen

Dynamic Program Table for String EditPARKMeasure distance between stringsSPAKEEdit operationsfor turningSPAKEintoPARKPRKdeleteSPAKEAndrew McCallum, UMass Amherst,including material from William CohenAinsertsubstitute

Dynamic Program Table for String EditPARKMeasure distance between stringsSPAKEPRKc00c02 c03c04c05Sc10c11 c12c13c14Pc20c21 c22c23c24Ac30c31 ?KEAndrew McCallum, UMass Amherst,including material from William CohenA

Dynamic Program Table for String EditPSPAARKc00c02 c03c04c05c10c11 c12c13c14c23c24c20substc30insertdeletec21 c22c31 ?KED(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1), if si tjD(i-1,j-1) 1, if si! tjD(i-1,j) 1D(i,j-1) 1//copy//substitute//insert//delete

Computing Levenshtein distance - 2D(i,j) score of best alignment from s1.si to t1.tj minD(i-1,j-1) d(si,tj) //subst/copyD(i-1,j) 1//insertD(i,j-1) 1//delete(simplify by letting d(c,d) 0 if c d, 1 else)also let D(i,0) i (for i inserts) and D(0,j) jAndrew McCallum, UMass Amherst,including material from William Cohen

Dynamic Program Table InitializedP0S1P2A3K4E51AR23K4D(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1) d(si,tj)D(i-1,j) 1D(i,j-1) 1//substitute//insert//delete

Dynamic Program Table . filling inP01S11P2A3K4E5AR23K4D(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1) d(si,tj)D(i-1,j) 1D(i,j-1) 1//substitute//insert//delete

Dynamic Program Table . filling inPAR0123S1123P2A3K4E5K44D(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1) d(si,tj)D(i-1,j) 1D(i,j-1) 1//substitute//insert//delete

Dynamic Program Table . filling inPAR0123S1123P21A3K4E5K44D(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1) d(si,tj)D(i-1,j) 1D(i,j-1) 1//substitute//insert//delete

Dynamic Program Table . filling inPARK01234S1123P21234A32123K43222E543334Final cost ofaligning all ofboth strings.D(i,j) score of best alignment from s1.si to t1.tj minAndrew McCallum, UMass Amherst,including material from William CohenD(i-1,j-1) d(si,tj)D(i-1,j) 1D(i,j-1) 1//substitute//insert//delete

DP String Edit Distance def stredit (s1,s2):"Calculate Levenstein edit distance for strings s1 and s2."len1 len(s1) # verticallylen2 len(s2) # horizontally# Allocate the tabletable [None]*(len2 1)for i in range(len2 1): table[i] [0]*(len1 1)# Initialize the tablefor i in range(1, len2 1): table[i][0] ifor i in range(1, len1 1): table[0][i] i# Do dynamic programmingfor i in range(1,len2 1):for j in range(1,len1 1):if s1[j-1] s2[i-1]:d 0else:d 1table[i][j] min(table[i-1][j-1] d,table[i-1][j] 1,table[i][j-1] 1)Andrew McCallum, UMass Amherst,including material from William Cohen

Remebering the Alignment (trace)D(i,j) minD(i-1,j-1) d(si,tj) //subst/copyD(i-1,j) 1//insertD(i,j-1) 1//deleteCA trace indicateswhere the minvalue came from,and can be used tofind editoperations and/ora best alignmentM(may be more than 1)Andrew McCallum, UMass Amherst,including material from William CohenCC112OHEN234523453345345O32H43N54233343

Three Enhanced Variants Needleman-Munch– Variable costs Smith-Waterman– Find longest “soft matching” subsequence Affine Gap Distance– Make repeated deletions (insertions) cheaper (Implement one for homework?)Andrew McCallum, UMass Amherst,including material from William Cohen

Needleman-Wunch distanceD(i,j) minD(i-1,j-1) d(si,tj) //subst/copyD(i-1,j) G//insertD(i,j-1) G//deleteG “gap cost”d(c,d) is an arbitrarydistance function oncharacters (e.g. relatedto typo frequencies,amino acidsubstitutibility, etc)Andrew McCallum, UMass Amherst,including material from William CohenWilliam CohenWukkuan Cigeb

Smith-Waterman distance Instead of looking at each sequence in itsentirety, this compares segments of allpossible lengths and chooses whichevermaximize the similarity measure. For every cell the algorithm calculates allpossible paths leading to it. These paths canbe of any length and can contain insertionsand deletions.Andrew McCallum, UMass Amherst,including material from William Cohen

Smith-Waterman distanceD(i,j) minG 1d(c,c) -2d(c,d) 1Andrew McCallum, UMass Amherst,including material from William Cohen0//start overD(i-1,j-1) d(si,tj) //subst/copyD(i-1,j) G//insertD(i,j-1) 3-6-5-3N0-2-5-5-7

Example output from 567890 * 0000000 *-2 -10000 *-1 -100000 *-3 -2 -1000 -2 *-5 -4000 -1 -4 *-7r100000-3-6(My implementation of HW#2, task choice #2. -McCallum)Andrew McCallum, UMass Amherst,including material from William Cohen

Affine gap distances Smith-Waterman fails on some pairs thatseem quite similar:William W. CohenWilliam W. ‘Don’t call me Dubya’ CohenIntuitively,singlea sertionsareis nsertionsthanAndrew McCallum, UMass Amherst,including material from William Cohen

Affine gap distances - 2 Idea:– Current cost of a “gap” of n characters: nG– Make this cost: A (n-1)B, where A is cost of“opening” a gap, and B is cost of “continuing” agap.Andrew McCallum, UMass Amherst,including material from William Cohen

Affine gap distances - 3D(i,j) maxD(i-1,j-1)D(i-1,j-1) d(si,tj)d(si,tj) //subst/copyD(i-1,j)-1IS(I-1,j-1) d(si,tj) //insertD(i,j-1)-1//deleteIT(I-1,j-1) d(si,tj)IS(i,j) maxD(i-1,j) - AIS(i-1,j) - BBest score in which siis aligned with a ‘gap’D(i,j-1) - AIT(i,j-1) - BBest score in which tjis aligned with a ‘gap’IT(i,j) maxAndrew McCallum, UMass Amherst,including material from William Cohen

Affine gap distances as ew McCallum, UMass Amherst,including material from William Cohen-A

Generative version of affine gapautomata (Bilenko&Mooney, TechReport 02)HMM emits pairs: (c,d) instate M, pairs (c,-) in stateD, and pairs (-,d) in state I.For each state there is amultinomial distributionon pairs.The HMM can trained withEM from a sample of pairsof matched strings (s,t)E-step is forward-backward; M-step uses some ad hoc smoothingAndrew McCallum, UMass Amherst,including material from William Cohen

Affine gap edit-distance learning:experiments results (Bilenko & Mooney)Experimental method: parse records into fields; append afew key fields together; sort by similarity; pick athreshold T and call all pairs with distance(s,t) T“duplicates”; picking T to maximize F-measure.Andrew McCallum, UMass Amherst,including material from William Cohen

Affine gap edit-distance learning:experiments results (Bilenko & Mooney)Andrew McCallum, UMass Amherst,including material from William Cohen

Affine gap edit-distance learning:experiments results (Bilenko & Mooney)Precision/recall for MAILING dataset duplicate detectionAndrew McCallum, UMass Amherst,including material from William Cohen

Affine gap distances – experiments(from McCallum, Nigam,Ungar KDD2000) Goal is to match data like this:Andrew McCallum, UMass Amherst,including material from William Cohen

Homework #2 The assignment– Start with my stredit.py code– Make some modifications– Write a little about your experiences Some possible modifications– Implement Needleman-Wunch, Smith-Waterman, or Affine GapDistance.– Create a little spell-checker: if entered word isn’t in the dictionary,return the dictionary word that is closest.– Change implementation to operate on sequences of words ratherthan characters. get an online translation dictionary, and findalignments between English & French or English & Russian!– Try to learn the parameters of the function from data. (Tough.)Andrew McCallum, UMass Amherst,including material from William Cohen

String Edit Distance Andrew Amdrewz 1. substitute m to n 2. delete the z Distance 2 Given two strings (sequences) return the "distance" between the two strings as measured by.the minimum number of "character edit operations" needed to turn one sequence into the other.

Related Documents:

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

The EDIT program provides two editors—EDIT, a line editor, and EDIT VS, a screen editor. The emphasis of this manual is on EDIT. The manual introduces you to the features and capabilities of EDIT, describes the many EDIT commands and their ranges, and provides information on creating and using EDIT files. For those users who wish to use EDIT .

Barber, Samuel String Quartet No.1, Op.11 Bartok, Bela String Quartet No.2, Op.17 String Quartet No.4 Beethoven, Ludwig van String Quartet No.1 in F major, Op.18 No.1 String Quartet No.2 in G major, “Compliments” Op.18 No.2 String Quartet No.6 in B-flat major, Op.18 No.6 String Quartet No.7 in F major, “Rasumovsky 1” Op.59 No.1

String Quartet n. 15 op. 144 Anton Webern String Quartet op. 28 Five Movements for String Quartet Six Bagatelles for String Quartet Alexander Von Zemlinsky String Quartet n. 2 op. 15 2) Toshio Hosokawa UTA-ORI. Weaving Song for string quartet (2020) New composition for String Quartet

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

Alternatively, you can use the operator as follows: a a b; which is equivalent to: a "string A" " and string B"; and equivalent to: a "string A" " " "and string B"; where the middle string is a string with a single whitespace character. Comparing Strings Comparing string values in

Tourism 2020 is a whole-of-government and industry strategy to build the resilience and competitiveness of Australia’s tourism industry and to increase its economic contribution to Australia’s economy. When the Tourism 2020 goal was introduced, it was set at between 115 billion to 140 billion in overnight visitor expenditure, reflecting a range of scenarios, from holding market share to .