Python的缩写检测

时间:2018-07-18 15:42:08

标签: python string nlp similarity fuzzy-comparison

我正在尝试评估公司名称的相似性,但是在尝试匹配这些名称的缩写时遇到了困难。例如:

IBM
The International Business Machines Corporation

我尝试使用fuzzywuzzy来衡量相似度:

>>> fuzz.partial_ratio("IBM","The International Business Machines Corporation")
33
>>> fuzz.partial_ratio("General Electric","GE Company")
20
>>> fuzz.partial_ratio("LTCG Holdings Corp","Long Term Care Group Inc")
39
>>> fuzz.partial_ratio("Young Innovations Inc","YI LLC")
33

您知道用于衡量此类缩写更高相似性的任何技术吗?

1 个答案:

答案 0 :(得分:3)

对于上面的示例,这似乎产生了更好的结果:

from fuzzywuzzy import fuzz, process

companies = ['The International Business Machines Corporation','General Electric','Long Term Care Group','Young Innovations Inc']
abbreviations = ['YI LLC','LTCG Holdings Corp','IBM','GE Company']

queries = [''.join([i[0] for i in j.split()]) for j in companies]

for company in queries:
    print(company, process.extract(company, abbreviations, scorer=fuzz.partial_token_sort_ratio))

这将产生:

TIBMC [('IBM', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 29), ('GE Company', 20)]
GE [('GE Company', 100), ('LTCG Holdings Corp', 50), ('YI LLC', 0), ('IBM', 0)]
LTCG [('LTCG Holdings Corp', 100), ('YI LLC', 50), ('GE Company', 25), ('IBM', 0)]
YII [('YI LLC', 80), ('LTCG Holdings Corp', 33), ('IBM', 33), ('GE Company', 33)]

for循环的一个小修改:

for query, company in zip(queries, companies):
    print(company, '-', process.extractOne(query, abbreviations, scorer=fuzz.partial_token_sort_ratio))

礼物:

The International Business Machines Corporation - ('IBM', 100)
General Electric - ('GE Company', 100)
Long Term Care Group - ('LTCG Holdings Corp', 100)
Young Innovations Inc - ('YI LLC', 80)