Levenshtein meets Python
My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like “KIRK ANDREWS FORCEFUL” or “AULDREEKIE ADDISON VACUM”.
It also happens that when people are searching for a bull, they may type in the wrong name, for example “AULDREKIE ADDISON VACUUM” or type “JISSELVLIEDT 135 BREAKOUT” instead of “IJSSELVLIEDT 135 BREAKOUT”.
So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:
BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)
While the following have a Levenshtein distance of 2:
PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)
Wikipedia has a very good description of the algorithm.
They also have the pseudocode which I have implemented in Python. Here it is:
'''
Write a function that calculates the Levenshtein distance between
two words.
'''
def levenshtein(first, second):
m = len(first) + 1
n = len(second) + 1
# Declare an array...
d = [[0] * n for j in range(m)]
# Initialise the array...
for i in range(m):
d[i][0] = i
for j in range(n):
d[0][j] = j
for i in xrange(1, m):
for j in range(1, n):
if first[i-1] == second[j-1]: # no operation required...
d[i][j] = d[i - 1][j - 1]
else:
d[i][j] = min( d[i - 1][j], d[i][j - 1],
d[i - 1][j - 1]) + 1
return d[m-1][n-1]
def main():
list1 = 'great grate'.split()
list2 = 'grate rate ate'.split()
for word1 in list1:
for word2 in list2:
print 'Levenshtein distance between %s and %s is %d' % (
word1, word2, levenshtein(word1, word2))
if __name__ == '__main__':
main()
Once again, the Python code looks very similar to the pseudocode.