Question

我的要求是找到2个列表的匹配名称。一个列表具有400个名称，第二个列表具有90000个名称。我得到了预期的结果，但过程需要35分钟以上。显而易见，有2个for循环，因此需要执行O（N * N）操作，这是瓶颈。我已经删除了两个列表中的重复项。你能帮助改善它吗？我检查了许多其他问题，但以某种方式无法实现。如果您认为我只是想念一些已经存在的帖子，请指出这一点。我将尽力理解并复制它。

下面是我的代码

from fuzzywuzzy import fuzz
infile=open('names.txt','r')
name=infile.readline()
name_list=[]
while name:
    name_list.append(name.strip())
    name=infile.readline()

print (name_list)

infile2=open('names2.txt','r')
name2=infile2.readline()
name_list2=[]
while name2:
    name_list2.append(name2.strip())
    name2=infile2.readline()

print (name_list2)

response = {}
for name_to_find in name_list:
    for name_master in name_list2:
        if fuzz.ratio(name_to_find,name_master) > 90:
            response[name_to_find] = name_master
            break

for key, value in response.items():
    print ("Key is ->" + key + "  Value is -> " + value)

Answer 1

最明显的方法是使用哈希表。伪代码：

确定较小的列表
基于较小的列表创建哈希表：

let x = BigInt(val); let y = 100000000n; // BigInt literals end in "n" return Number(x / y);
遍历第二个列表，并检查第一个列表中是否存在名称键：

hash1 ={name: 1 for name in name_list}

就是这样。您会得到两个列表中都存在的名称列表

Answer 2

在不了解fuzz背后的算法的情况下，我怀疑我们可以做很多事情来减少渐近运行时间。可能会有一些技巧来修剪明显不好的对，但可能不多于此。另一个答案假设您正在执行精确匹配-不适用于模糊字符串匹配。

您可以尝试做的是对呼叫进行批处理，并希望Fuzzywuzzy在其process中为批处理优化了一些逻辑。像

from fuzzywuzzy import process

for name in names400:
    matches = filter(lambda x: x[1] > 90, process.extract(name, names90000, limit=90000))
    for match_name, score in matches:
         response[match_name] = name

还请注意，在github page上，由于模糊不清，他们提到使用python levenshtein可以使计算速度提高4-10倍。

改善模糊不清-在2个列表中匹配名称

2 个答案: