Question

我有一个格式正确的公司名称列表，我试图查找这些公司何时出现在文档中。问题在于它们不太可能像在列表中一样确切地出现在文档中。例如，Visa Inc可能显示为Visa或American Airlines Group Inc可能显示为American Airlines。

在找到完全匹配的内容时，我该如何遍历文档的所有内容，然后返回格式正确的公司名称？

我已经尝试过fuzzywuzzy和difflib.get_close_matches，但是问题是它着眼于每个单独的单词，而不是单词的簇：

from fuzzywuzzy import process
from difflib import get_close_matches

company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']

text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'

#using fuzzywuzzy
for word in text.split():
    print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))

#using get_close_matches
for word in text.split():
    match = get_close_matches(word, company_name, n=1, cutoff=.4)
    print(match)

Answer 1

我正在研究类似的问题。 Fuzzywuzzy内部使用difflib，并且两者在大型数据集上的执行速度都很慢。

Chris van den Berg的pipeline使用TF-IDF矩阵将公司名称转换为3克矢量，然后使用余弦相似度比较这些矢量。

流水线速度很快，对于部分匹配的字符串也能提供准确的结果。

Answer 2

对于这种类型的任务，我使用记录链接算法，它将在ML的帮助下为您找到那些集群。您将必须提供一些实际示例，以便算法可以学习正确标记数据集的其余部分。

以下是一些信息： https://pypi.org/project/pandas-dedupe/

干杯

如何使用Python在文本中查找公司名称

2 个答案: