数组

时间:2016-03-02 20:34:40

标签: python list

我正在考虑一个非常简单的抄袭探测器。为简单起见,假设您在beignning中有两个列表,每个列表都包含一些字符串元素,例如:

l1 = [ "I","like","big","yellow","bananas" ]
l2 = [ "I","like","yellow","bananas" ]

用户还可以指定每个操作“花费”多少,让我们说:

DeletePrice = 10          #deleting word from one list
InsertPrice = 1           #insterting a word to one list
SubstitutePrice = 24      #substituing a word for another one

任务是匹配列表,组合价格必须尽可能低。有两种明显的方法可以匹配这些数组,一种是从第一个数组中删除单词“big”(这将花费10)或向第二个数组中插入一个单词“big”(这将花费1)。因此算法的答案是1.

我一直在想,一开始我们应该使用list comprehention找到不匹配的元素:

def Plagiarism( l1,l2,dPrice,iPrice,sPrice ):
    not_matching_elements = [ [ x for x in l1 if x not in l2 ],[ x for x in l2 if x not in l1 ] ]

在   not_matching_elements  列表会给我们     [ [ big ],[] ] 并且可能会帮助我们继续前进。但我无法找到进一步发展该算法的方法。谢谢。

1 个答案:

答案 0 :(得分:2)

你所描述的与Levenshtein距离非常相似:

https://en.wikipedia.org/wiki/Levenshtein_distance

您只需要数组条目而不是字符。

您可以简单地找到levensthein算法并根据您的需要进行更改

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

此算法甚至可能适用于数组;)

编辑:这有效:

def levenshtein(s1, s2):
    insertionCost=1
    deletionCost=1
    substitutionCost=1

    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return min(deletionCost,insertionCost)*len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + insertionCost # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + deletionCost       # than s2
            substitutions = previous_row[j] + substitutionCost*(c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]


#returns 2
print(levenshtein(['abc','def','ghi'],['abc','ghi','e']))