Python Levenshtein distance – Choose Python package wisely

Brad and I were working on some text similarity computation. One of the most popular string distance functions is the Levenshtein distance, which is also called the edit distance. We use Python for its brevity and widely-library support (OK, I lied. The truth is that I was persuaded by Brad about his miserable experience interacting with R…) As we would like to try different distance functions, we picked up Python distance package (pip install distance). Surprisingly, we found the Levenshtein is pretty slow comparing to other distance functions (well, regardless of the complexity of the algorithm itself). The next day, Brad found another Python package – editdistance (pip install editdistance), which is 2 order of magnitude faster than the distance package. In this post, we will exam these 2 packages and the other 2 implementations according to Wikipedia. K.R.K.C.

1. Run the Damn Code!

# A Levenshtein distance demo
# using distance pkg and editdistance pkg
# and yet another 2 implementations
# May 2, 2015
# root@davejingtian.org
# http://davejingtian.org
import cProfile
import editdistance
import distance

loop = 10000
dist = 0
str1 = 'kitten'
str2 = 'sitting'
script1 = 'for i in range(loop):\n\tdist = distance.levenshtein(str1, str2)' 
script2 = 'for i in range(loop):\n\tdist = editdistance.eval(str1, str2)'

# A K.I.S.S. implementation using recursion
# Ref: http://en.wikipedia.org/wiki/Levenshtein_distance
def recLev(s, lenS, t, lenT):
	if lenS == 0 or lenT == 0:
		return 0
	if s[lenS-1] == t[lenT-1]:
		cost = 0
	else:
		cost = 1
	return min(recLev(s, lenS-1, t, lenT)+1, recLev(s, lenS, t, lenT-1)+1, recLev(s, lenS-1, t, lenT-1)+cost)

script3 = 'for i in range(loop):\n\tdist=recLev(str1, len(str1), str2, len(str2))'

# A fast and memory efficient implementation
# by Hjelmqvist, Sten
def fastMemLev(s, t):
	# degenerate cases
	if s == t:
		return 0
	if len(s) == 0:
		return len(t)
	if len(t) == 0:
		return len(s)
 
	# create two work vectors of integer distances
	#int[] v0 = new int[t.Length + 1];
	#int[] v1 = new int[t.Length + 1];
	v0 = []
	v1 = []
 
	# initialize v0 (the previous row of distances)
	# this row is A[0][i]: edit distance for an empty s
	# the distance is just the number of characters to delete from t
	# for (int i = 0; i < v0.Length; i++)
	# v0[i] = i;
	for i in range(len(t)+1):
		v0.append(i)
		v1.append(0)

	for i in range(len(s)): 
		# calculate v1 (current row distances) from the previous row v0
		# first element of v1 is A[i+1][0]
		# edit distance is delete (i+1) chars from s to match empty t
		v1[0] = i + 1
 
		# use formula to fill in the rest of the row
		for j in range(len(t)):
			cost = 0 if s[i] == t[j] else 1;
			v1[j + 1] = min(v1[j]+1, v0[j+1]+1, v0[j]+cost)
 
		# copy v1 (current row) to v0 (previous row) for next iteration
		for j in range(len(t)+1):
			v0[j] = v1[j]
 
	return v1[len(t)]

script4 = 'for i in range(loop):\n\tdist=fastMemLev(str1, str2)'


# Profile
cProfile.run(script1)
print('script1:', dist)
print('='*20)
cProfile.run(script2)
print('script2:', dist)
print('='*20)
cProfile.run(script3)
print('script3:', dist)
print('='*20)
cProfile.run(script4)
print('script4:', dist)

[daveti@daveti python]$ python ./leven.py
540003 function calls in 0.881 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.010 0.010 0.881 0.881 :1()
10000 0.699 0.000 0.871 0.000 _levenshtein.py:6(levenshtein)
20000 0.002 0.000 0.002 0.000 {len}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
420000 0.139 0.000 0.139 0.000 {min}
90001 0.031 0.000 0.031 0.000 {range}

(‘script1:’, 3L)
====================
10003 function calls in 0.013 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 0.013 0.013 :1()
10000 0.009 0.000 0.009 0.000 {editdistance.bycython.eval}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
1 0.000 0.000 0.000 0.000 {range}

(‘script2:’, 3L)
====================
396510003 function calls (99150003 primitive calls) in 220.297 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.019 0.019 220.297 220.297 :1()
297370000/10000 198.277 0.000 220.276 0.022 leven.py:20(recLev)
20000 0.002 0.000 0.002 0.000 {len}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
99120000 22.000 0.000 22.000 0.000 {min}
1 0.000 0.000 0.000 0.000 {range}

(‘script3:’, 3)
====================
900003 function calls in 0.629 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.007 0.007 0.629 0.629 :1()
10000 0.461 0.000 0.622 0.000 leven.py:33(fastMemLev)
170000 0.014 0.000 0.014 0.000 {len}
160000 0.015 0.000 0.015 0.000 {method ‘append’ of ‘list’ objects}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
420000 0.097 0.000 0.097 0.000 {min}
140001 0.035 0.000 0.035 0.000 {range}

(‘script4:’, 3)
[daveti@daveti python]$

2. Analysis

As one can tell, editdistance is much more faster than distance on Levenshtein. This is not a big surprise when you realize that the editdistance is a C++ implementation while distance is a pure Python implementation. Actually, both of them use iterative method with matrix rather than a straightforward recursive implementation, which is the slowest but with least lines of code. In a word, there is nothing wrong with the distance implementation. A most recent fast and memory efficient algorithm from Hjelmqvist, Sten, using Python is also examined. It is faster than the distance implementation, which proves again that distance implementation is not that evil. However, there should be some improvement for the distance implementation to achieve similar performance comparing to the fastMem. A quick glance at the distance implementation, they are using numpy array, which could be a bottleneck.

3. Be wise

Let those who claim that Python is fast go to hell. It all depends on which Python is comparing to. There is no doubt that Python implementation should be slower than the original C/C++ implementation. If performance is all about, try to find Python packages which are implemented in C/C++ or implement the Python interface by yourself wrapping your favorite C/C++ implementation. Be wise about your choice~

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.
This entry was posted in Programming and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s