Question

我创建了一个小程序，用于检查作者是否存在于作者数据库中。我还没能找到解决这个问题的任何特定模块，因此我使用模块进行近似字符串匹配，从头开始编写。

该数据库包含大约6000名作者，格式很差（许多拼写错误，变体，标题如＆＃34; Dr。＆＃34;等）。查询作者列表通常在500-1000之间（我有很多这些列表），速度非常重要。

我的一般策略是尽可能地修剪和过滤数据库，并寻找完全匹配。如果没有找到匹配项，我继续进行近似字符串匹配。

我目前正在使用内置的difflib.get_close_matches，它完全符合我的要求 - 但是，它非常慢（几分钟）。因此，我正在寻找其他选择：

在提供查询字符串的数据库中，哪个最快的模块可以返回最佳，比如3个匹配超过某个阈值？
比较两个字符串的最快模块是什么？

我发现的唯一一个是模糊的模糊，甚至比difflib慢。

Answer 1

在安装了native-C fuzzywuzzy lib的情况下尝试python-levenshtein。

我在我的电脑上运行一个基准测试，用于在安装和不安装C-native levenshtein backend的情况下找到~19k单词列表中8个单词的最佳候选者（使用pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl），我得到了这些时间：

没有C-backend：
比较48.591717004776秒（0.00032039058052521366秒/搜索）中的151664个单词。
安装C-backend：
比较13.034106969833374秒中的151664个单词（8.594067787895198e-05秒/搜索）。

~x4更快（但没有我想象的那么多）。

结果如下：

0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`

以下是基准测试的python代码：

import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random

download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)

def fuzzy_match(dictionary, search):
    nsearch = len(search)
    for i, s in enumerate(search):
        best = fzproc.extractBests(s, dictionary)
        print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))

def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
    """ Shuffle and split words-list into `dictionary` and `search-words`. """
    rnd = random.Random(0)
    rnd.shuffle(wordslist)
    nwords = len(wordslist)
    ndictionary = int(dict_split_ratio * nwords)

    dictionary = wordslist[:ndictionary]
    search = wordslist[ndictionary:]
    fuzzy_match(dictionary, search)

    return ndictionary, (nwords - ndictionary)

def run_benchmark():
    if not os.path.exists(zip_name):
        urlrequest.urlretrieve(download_url, filename=zip_name)

    with zipfile.ZipFile(zip_name, 'r') as zfile:
        with zfile.open(fname) as words_file:
            blines = words_file.readlines()
            wordslist = [line.decode('ascii').strip() for line in blines]
            wordslist = wordslist[4:]  # Skip header.

            t_start = time.time()
            ndict, nsearch = benchmark_fuzzy_match(wordslist)
            t_finish = time.time()

            t_elapsed = t_finish - t_start
            ncomparisons = ndict * nsearch
            sec_per_search = t_elapsed / ncomparisons
            msg = "Compared %s words in %s sec (%s sec/search)."
            print(msg % (ncomparisons, t_elapsed, sec_per_search))

if __name__ == '__main__':
    run_benchmark()

Answer 2

Python的自然语言工具包（nltk）可能会有一些你可以尝试的额外资源 - this google groups thread似乎是一个良好的开端。只是一个想法。

作者姓名的近似字符串匹配 - 模块和策略

2 个答案: