l1 = [ "I","like","big","yellow","bananas" ]
l2 = [ "I","like","yellow","bananas" ]
用户还可以指定每个操作“花费”多少,让我们说:
DeletePrice = 10 #deleting word from one list
InsertPrice = 1 #insterting a word to one list
SubstitutePrice = 24 #substituing a word for another one
任务是匹配列表,组合价格必须尽可能低。有两种明显的方法可以匹配这些数组,一种是从第一个数组中删除单词“big”(这将花费10)或向第二个数组中插入一个单词“big”(这将花费1)。因此算法的答案是1.
我一直在想,一开始我们应该使用list comprehention找到不匹配的元素:
def Plagiarism( l1,l2,dPrice,iPrice,sPrice ):
not_matching_elements = [ [ x for x in l1 if x not in l2 ],[ x for x in l2 if x not in l1 ] ]
在
not_matching_elements
列表会给我们
[ [ big ],[] ]
并且可能会帮助我们继续前进。但我无法找到进一步发展该算法的方法。谢谢。
答案 0 :(得分:2)
你所描述的与Levenshtein距离非常相似:
https://en.wikipedia.org/wiki/Levenshtein_distance
您只需要数组条目而不是字符。
您可以简单地找到levensthein算法并根据您的需要进行更改
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
此算法甚至可能适用于数组;)
编辑:这有效:
def levenshtein(s1, s2):
insertionCost=1
deletionCost=1
substitutionCost=1
if len(s1) < len(s2):
return levenshtein(s2, s1)
# len(s1) >= len(s2)
if len(s2) == 0:
return min(deletionCost,insertionCost)*len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + insertionCost # j+1 instead of j since previous_row and current_row are one character longer
deletions = current_row[j] + deletionCost # than s2
substitutions = previous_row[j] + substitutionCost*(c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
#returns 2
print(levenshtein(['abc','def','ghi'],['abc','ghi','e']))