我有一个python程序来读取两个列表(一个有错误,另一个有正确的数据)。列表中包含错误的每个元素都需要与正确列表中的每个元素进行比较。比较后,我得到每个比较对之间的所有编辑距离。现在我可以找到给定错误数据的最小编辑距离,并通过获取我的正确数据。
我正在尝试使用levenshtein距离来计算编辑距离,但它将所有编辑距离返回为1,即使这是错误的。
这意味着计算levenshtein距离的代码不正确。我正在努力寻找解决方案。帮助!
我的代码
import csv
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
if __name__ == "__main__":
with open("all_correct_promo.csv","rb") as file1:
reader1 = csv.reader(file1)
correctPromoList = list(reader1)
#print correctPromoList
with open("all_extracted_promo.csv","rb") as file2:
reader2 = csv.reader(file2)
extractedPromoList = list(reader2)
#print extractedPromoList
incorrectPromo = []
count = 0
for extracted in extractedPromoList:
if(extracted not in correctPromoList):
incorrectPromo.append(extracted)
else:
count = count + 1
#print incorrectPromo
for promos in incorrectPromo:
for correctPromo in correctPromoList:
distance = lev(promos,correctPromo)
print promos, correctPromo , distance
答案 0 :(得分:1)
实施是正确的。我测试了这个:
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
print lev('abcde','bc') # prints 3, which is correct
print lev('abc','bc') # prints 1, which is correct
正如您在评论中所注意到的那样,您的问题可能就在您致电该方法时:
a = ['NSP-212690']
b = ['FE SV X']
print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings
print lev(a[0],b[0]) # prints 10, which is correct
所以,你能做的是:
在调用“lev(a,b)”之前,提取每个数组的第一个元素
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
a = ['NSP-212690']
b = ['FE SV X']
a = a[0] # this is the key part
b = b[0] # and this
print lev(a,b) # prints 10, which is correct
无论如何,我不建议您使用递归实现,因为性能很差
我建议使用此实现(来源:wikipedia-levenshtein)
def lev(seq1, seq2):
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
return thisrow[len(seq2) - 1]
或者这个稍微修改过的版本:
def lev(seq1, seq2):
if not a: return len(b)
if not b: return len(a)
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
return thisrow[len(seq2) - 1]
答案 1 :(得分:0)
有一个Python包可以实现levenshtein距离:python-levenshtein
安装它:
pip install python-levenshtein
使用它:
>>> import Levenshtein
>>> string1 = 'dsfjksdjs'
>>> string2 = 'dsfiksjsd'
>>> print Levenshtein.distance(string1, string2)
3