## Approximate String Matching

Searching for patterns in text strings is a problem of unquestionable importance.

To deal with inexact string matching, we must first define a cost function telling us how far apart two strings are – i.e., a distance measure between pairs of strings. A reasonable distance measure reflects the number of *changes* that must be made to convert one string to another. There are three natural types of changes:

- Substitution – Replace a single character from pattern P with a different character in text T, such as changing “shot” to “spot”.
- Insertion – Insert a single character into pattern P to help it match text T, such as changing “ago” to “agog”.
- Deletion – Delete a single character from pattern P to help it match text T, such as changing “hour” to “our”.

Properly posing the question of string similarity requires us to set the cost of each of these string transform operations. Assigning each operation an equal cost of 1 defines the *edit distance* between two strings.

Approximate string matching seems like a difficult problem, because we must decide exactly where to delete and insert (potentially) many characters in pattern and text. But let us think about the problem in reverse. What information would we like to have to make the final decision? What can happen to the last character in the matching for each string?

### Edit Distance by Recursion

We can define a recursive algorithm using the observation that the last character in the string must either be matched, substituted,

More precisely, let D[i,j] be the minimum number of differences between P_1, P_2, \cdots, P_i and the segment of T ending at j. D[i,j] is the minimum of the three possible ways to extend smaller strings:

- If (P_i = T_j) then D[i-1, j-1], else D[i-1, j-1] + 1.
- D[i,j-1] + 1. This means that there is an extra character in the pattern to account for.
- D[i-1, j] + 1. This means that there is an extra character in the text to remove.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #define MATCH 0 /* enumerated type symbol for match */ #define INSERT 1 /* enumerated type symbol for insert */ #define DELETE 2 /* enumerated type symobl for delete */ int string_compare(char *s, char *t, int i, int j) { /* Note: We use index conventions here */ int k; int opt[3]; int lowest_cost; if (i == 0) return (j * indel(' ')); if (j == 0) return (i * indel(' ')); opt[MATCH] = string_compare(s, t, i-1, j-1) + match(s[i], t[j]); opt[INSERT] = string_compare(s, t, i, j-1) + indel(t[j]); opt[DELETE] = string_compare(s, t, i-1, j) + indel(s[i]); lowest_cost = opt[MATCH]; for (k=INSERT; k<=DELETE; k++) if (opt[k] < lowest_cost) lowest_cost = opt[k]; return( lowest_cost ); } |

This program is absolutely correct – convince yourself. It also turns out to be impossibly slow.

Why is the algorithm so slow? It takes exponential time because it recomputes values again and again and again.

### Edit Distance by Dynamic Programming

So how can we make this algorithm practical? The important observation is that most of these recursive calls are computing things that have been previously computed. There can only be |P| \cdot |T| possible unique recursive

A table-based dynamic programming implementation of this algorithm is given below. The table is a two-dimensional matrix m where each of the |P| \cdot |T| cells contains the cost of the optimal solution to a subproblem, as well as a parent pointer explaining how we got to this location:

1 2 3 4 5 6 | typedef struct { int cost; /* cost of reaching this cell */ int parent; /* parent cell */ } cell; cell m[MAXLEN+1][MAXLEN+1]; /* dynamic programming table */ |

Our dynamic programming implementation has three differences from the recursive version. First, it gets its intermediate values using table lookup instead of recursive calls. Second, it updates the `parent`

field of each cell, which will enable us to reconstruct the edit sequence later. Third, it is implemented using a more general `goal_cell()`

function instead of just returning `m[|P|][|T|].cost`

. This will enable us to apply this routine to a wider class of problems.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | int string_compare(char *s, char *t) { int i,j,k; /* counters */ int opt[3]; /* cost of the three options */ for (i=0; i<MAXLEN; i++) { row_init(i); column_init(i); } for (i=1; i<strlen(s); i++) { for (j=1; j<strlen(t); j++) { opt[MATCH] = m[i-1][j-1].cost + match(s[i], t[j]); opt[INSERT] = m[i][j-1].cost + indel(t[j]); opt[DELETE] = m[i-1][j].cost + indel(s[i]); m[i][j].cost = opt[MATCH]; m[i][j].parent = MATCH; for (k=INSERT; k<=DELETE; k++) if (opt[k] < m[i][j].cost) { m[i][j].cost = opt[k]; m[i][j].parent = k; } } } goal_cell(s, t, &i, &j); return( m[i][j].cost ); } |

### Reconstructing the Path

The string comparison function returns the cost of the optimal alignment, but not the alignment itself. Knowing you can convert “though shalt not” to “you should not” in only five moves is dandy, but what is the sequence of editing operations that does it?

The possible solutions to a given dynamic programming problem are described by paths through the dynamic programming matrix, starting from the initial configuration (the pair of empty strings (0,0)) down to the final goal state (the pair of full strings (0,0)) down to the final goal state (the pair of full strings (|P|, |T|)). The key to building the solution is to reconstruct the decision made at every step along the optimal path that leads to the goal state. These decisions have been recorded in the `parent`

field of each array cell.

Reconstructing these decisions is done by walking backward from the goal state, following the `parent`

pointer back to an earlier cell. We repeat this process until we arrive back at `parent`

field for `m[i,j]`

tells us whether the operation at (i,j) was `MATCH`

, `INSERT`

, or `DELETE`

. Tracing back through the parent matrix yields the edit sequence.

Walking backward reconstructs the solution in reverse order. However, clever use of recursion can do the reversing for us.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | reconstruct_path(char *s, char *t, int i, int j) { if (m[i][j].parent == -1) return; if (m[i][j].parent == MATCH) { reconstruct_path(s, t, i-1, j-1); match_out(s, t, i, j); return; } if (m[i][j].parent == INSERT) { reconstruct_path(s, t, i, j-1); insert_out(t,j); return; } if (m[i][j].parent == DELETE) { reconstruct_path(s, t, i-1, j); delete_out(t,i); return; } } |

For many problems, including edit distance, the tour can be reconstructed from the cost matrix without explicitly retaining the last-move pointer array. The trick is working backward from the costs of the three possible ancestor cells and corresponding string characters to reconstruct the move the took you to the current cell at the given cost.

### Varieties of Edit Distance

The `string_compare`

and path reconstruction routines reference several functions that awe have not yet defined. These fall into four categories:

- Table Initialization
- Penalty Costs
- Goal Cell Identification
- Traceback Actions

Although dynamic programming algorithms are easy to design once you understand the technique, getting the details right reuqires carefully thinking and thorough testing.

This may seem to be a lot of infrastructure to develop for such a simple algorithm. However, several important problems can now be solved as special cases of edit distance using only minor changes to some of these stub functions:

- Substring Matching – Suppose that we want to find where a short pattern P best occurs within a long text T – say, searching, for “Skiena” in all its misspellings within a long file. Plugging this search into our original edit distance function will achieve little
sensitivity, since the vast majority of any edit cost will consist of deleting that which is not “Skiena” from the body of the text.

We want an edit distance search where the cost of starting the match is independent of the position in the

- Longest common Subsequence – Perhaps we are interested in finding the longest scattered string of characters included within both strings.

A common subsequence is defined by all the identical-character matches in an edit trace. To maximize number of such matches, we must prevent substitution of nonidentical characters. With substitution forbidden, the only way to get rid of the noncommon subsequence is through insertion and deletion. The minimum cost alignment has the fewest such “in-dels”, so it must preserve the longest common substring. We get the alignment we want by changing the match-cost function to make substitutions expensive:

1 2 3 4 | int match(char c, char d) { if (c == d) return(0); else return(MAXLEN); } |

- Maximum Monotone Subsequence – A numerical sequence is
*monotonical increasing*if the element is at element is at least as big as the (i-1)st element. The*maximum monotone subsequence*problem seeks to delete the fewest number of elements from an input S to leave a monotonically increasing subsequence.

In fact, this is just a longest common subsequence problem, where the second string is the elements of S stored in increasing order. Any common sequence of these two must (a) represent characters in proper order in S, and (b) use only characters with increasing position in the collating sequence – so, the longest one does the job. Of course, this approach can be modified to give the longest descreasing sequence by simply reversing the sorted order.

As you can see, our edit distance routine can be made to do many amazing things easily. The trick is ovserving that your problem is just a special case of approximate string matching.