如何比较列表中的每个元素与另一个列表中的每个元素?

时间:2016-11-24 06:10:18

标签: python text analytics

我想将提取的促销代码列表与正确的促销代码列表进行比较。

如果extract_list中与促销代码列表中的促销代码进行比较的促销代码未找到完全匹配,则表示促销代码有错误。为了从correct_promo_codes列表中找到正确的促销代码,我需要找到具有最小编辑距离(levenshtein距离)的促销代码与被比较的促销代码(来自extract_list)。

到目前为止

代码: -

import csv

with open("all_correct_promo.csv","rb") as file1:
    reader1 = csv.reader(file1)
    correctPromoList = list(reader1)
    #print correctPromoList

with open("all_extracted_promo.csv","rb") as file2:
    reader2 = csv.reader(file2)
    extractedPromoList = list(reader2)
    #print extractedPromoList

incorrectPromo = []
count = 0
for extracted in extractedPromoList:
    if(extracted not in correctPromoList):
        incorrectPromo.append(extracted)
    else:
        count = count + 1
#print incorrectPromo

for promos in incorrectPromo:
    print promos

1 个答案:

答案 0 :(得分:0)

根据nltk docs

nltk.metrics.distance.edit_distance(s1, s2, transpositions=False)

计算两个字符串之间的Levenshtein编辑距离。编辑距离是将s1转换为s2时需要替换,插入或删除的字符数。例如,将“下雨”变为“闪耀”需要三个步骤,包括两个替换和一个插入:“下雨” - > “sain” - > “shin” - > “闪耀”。这些操作可以在其他订单中完成,但至少需要三个步骤。

来到代码,我认为下半部分的某些更改会捕获编辑距离 -

from nltk.metrics import distance # slow to load

extractedPromoList = ['abc','acd','abd'] # csv of extracted promo codes dummy
correctPromoList = ['abc','aba','xbz','abz','abx'] # csv to real promo codes dummy

def find_min_edit(str_,list_):
    nearest_correct_promos = []
    distances = {}
    min_dist = 100 # arbitrary large assignment
    for correct_promo in list_:
        dist = distance.edit_distance(extracted,correct_promo,True) # compute Levenshtein distance
        distances[correct_promo] = dist # store each score for real promo codes
        if dist<min_dist:
            min_dist = dist # store min distance
    # extract all real promo codes with minimum Levenshtein distance
    nearest_correct_promos.append(','.join([i[0] for i in distances.items() if i[1]==min_dist])) 
    return ','.join(nearest_correct_promos) # return a comma separated string of nearest real promo codes

incorrectPromo = {}
count = 0
for extracted in extractedPromoList:
    print 'Computing %dth promo code...' % count
    incorrectPromo[extracted] =  find_min_edit(extracted,correctPromoList) # get comma separated str of real promo codes nearest to extracted
    count+=1
print incorrectPromo

<强>输出

Computing 0th promo code...
Computing 1th promo code...
Computing 2th promo code...
{'abc': 'abc', 'abd': 'abx,aba,abz,abc', 'acd': 'abx,aba,abz,abc'}