Levenshtein meets Python

My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like “KIRK ANDREWS FORCEFUL” or “AULDREEKIE ADDISON VACUM”.

It also happens that when people are searching for a bull, they may type in the wrong name, for example “AULDREKIE ADDISON VACUUM” or type “JISSELVLIEDT 135 BREAKOUT” instead of “IJSSELVLIEDT 135 BREAKOUT”.

So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:

BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)

While the following have a Levenshtein distance of 2:

PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)

Wikipedia has a very good description of the algorithm.

They also have the pseudocode which I have implemented in Python. Here it is:

'''
Write a function that calculates the Levenshtein distance between
two words.
'''

def levenshtein(first, second):
    m = len(first) + 1
    n = len(second) + 1

    # Declare an array...
    d = [[0] * n for j in range(m)]

    # Initialise the array...
    for i in range(m):
        d[i][0] = i

    for j in range(n):
        d[0][j] = j

    for i in xrange(1, m):
        for j in range(1, n):
            if first[i-1] == second[j-1]: # no operation required...
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min( d[i - 1][j], d[i][j - 1],
                    d[i - 1][j - 1]) + 1
    return d[m-1][n-1]


def main():
    list1 = 'great grate'.split()
    list2 = 'grate rate ate'.split()

    for word1 in list1:
        for word2 in list2:
            print 'Levenshtein distance between %s and %s is %d' % (
                word1, word2, levenshtein(word1, word2))


if __name__ == '__main__':
    main()

Once again, the Python code looks very similar to the pseudocode.