Question

我的数据库中有1200万个公司名称。我想将其与离线列表匹配。我想知道最好的算法。我已经通过Levenstiens距离做到了这一点，但并未给出预期的结果。您能否建议相同的一些算法。问题与像这样的公司匹配

G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated

Answer 1

您可能应该首先扩展两个列表（数据库和列表）中的已知后缀。这将需要一些手动工作来找出正确的映射，例如使用正则表达式：

\s+inc\.?$-> Incorporated
\s+corp\.?$-> Corporation

您可能还需要进行其他归一化处理，例如降低所有内容的大小写，删除标点符号等。

然后可以使用Levenshtein距离或其他模糊匹配算法。

Answer 2

您可以使用fuzzyset，将所有公司名称放在模糊集中，然后匹配一个新术语以获得匹配分数。一个例子：

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')

此外，如果您要使用语义，而不仅仅是字符串（在这种情况下效果更好），那么请查看spacy similarity。来自spacy docs的示例：

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Answer 3

Interzoid的公司名称匹配高级API会生成相似性键来帮助解决此问题...您调用API会生成相似性键，以消除所有噪音，已知同义词，soundex，ML等...然后进行匹配相似性关键字而不是数据本身来获得更高的匹配率（商业API，免责声明：我为Interzoid工作）

https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance

Answer 4

使用 MatchKraft 模糊匹配两个列表上的公司名称。

http://www.matchkraft.com/

Levenstiens 距离不足以解决这个问题。您还需要以下内容：

改进执行时间的启发式方法
信息检索 (Lucene) 和 SQL
公司名称数据库

最好使用现有工具，而不是在 Python 中创建程序。

大约匹配公司名称

4 个答案: