Algorithms – Possible Improvements to Damerau-Levenshtein Algorithm

I recently implemented the Damerau-Levenshtein distance algorithm from the pseudocode on Wikipedia. I couldn't find any explanation of exactly how it works and the pseudocode uses completely uninformative variable names like DA, DB, i1, and j1 that left me scratching my head.

Here's my implementation in Python: https://gist.github.com/badocelot/5327337

The Python implementation helped me walk through the program and figure out what was happening, renaming the variables to more helpful names. I was familiar enough with the Wagner-Fischer approach to calculating the Levenshtein distance that I had a frame of reference.

At the risk of being overly lengthy, here's how I understand Damerau-Levenshtein:

The mystery variables:

DA (last_row in my code) is a kind of map holding the last row each element was seen on; in my code it's an actual Python dictionary
DB (last_match_col) holds the last column where the letter in b matched the letter in a for the current row
i1 (last_matching_row) is the row number from DA for the current letter in b
j1 is just a copy of the value of DB/last_match_col before it's potentially updated; in my code I just moved where last_match_col is updated and eliminated this variable

The transposition cost:

H[i1][j1] + (i-i1-1) + 1 + (j-j1-1)

is calculating the cost of swapping the current character in b with the last character in b known to be in a (the last match), treating all characters in between as either additions or deletions.

Components of the cost:

H[i1][j1] reverts the base cost to the point in the calculations before the transposition, since finding a transposition invalidates previous work
(i-i1-1) is the distance between the current row and last row matching the current character, which is the number of deletions that would be required
(j-j1-1) is the distance between the current column and the last column with a match, which is the number of additions
The extra + 1 is just the cost of the transposition itself

If this analysis is incorrect, I'd love to know where I went wrong. As I said, I couldn't find any detailed explanation of how the algorithm works online.

Improved version?

Having figured that out, though, it struck me that by calculating the cost of both additions and deletions between transposed letters seemed flawed: one addition and one deletion is equivalent to a substitution, which this isn't checking for.

If all that's correct, the solution should be trivial: the cost of the letters between transposed letters should be the higher of the additions and deletions: convert as many into substitutions as possible and add in any additions or deletions left over.

So the cost would be:

H[i1][j1] + max((i-i1-1), (j-j1-1)) + 1

Here's my code for this version: https://gist.github.com/badocelot/5327427

From some simple tests, this seems correct. For example, "abcdef" -> "abcfad" gives an edit distance of 2 (transpose "d" and "f", change "e" to "a"), while the original algorithm gives a distance of 3 (either last three letters are substitutions, or 1 transposition + 1 addition + 1 deletion).

Now, I cannot be the first person to have thought of this. So, why didn't I run across it? Did I just not search long enough? Or is there some subtle flaw that prevents this from actually working?

Best Answer

I had to look up the Damerau-Levenshtein distance on wikipedia, so forgive me if this is wrong. But it looks like it only allows for transposing adjacent letters and not any arbitrary ones. So your example "abcdef" -> "abcfad" with transpose of d and f doesn't work. It seems to me you have modified the definition of the algorithm and are no longer calculating the Damerau-Levenshtein distance.

Best Answer

Related Solutions

How to Write a Generalized String Reverse Function for All Localizations and String Types

Data Structures – Efficiently Storing All Possible Phone Numbers

Related Topic