我有一个代表相同对象的字符串列表,但是每个字符串可以有一个略有不同的名称。我试图从列表中找到最“共识”的字符串,以将其用作“黄金来源”类型的值。
此类数据的示例可能是:
Procter & Gamble Co.
Procter & Gamble co
Procter & Gamble Co (The)
我实现了一个有效的示例,但其逻辑尚不明确,我想知道是否有库可以帮助我高效地完成此任务。我的算法基本上是寻找值的best pair
而不是best one to many
的集合(我还真不知道该怎么做)。之所以运作良好,是因为我的列表通常是3-5个元素,但是随着列表的增加,我可能会得出两个相同的错误结果,以得出更好的结果。
我的示例如下:
def best_name(frame):
"""build a dictionary from frame data"""
data = frame2dict(frame)
logging.info("Getting the best name, source data: {}".format(data))
"""compare values in each row, skipping comparison with self"""
for item in data:
item['matches'] = dict()
for each in data:
if item['source'] == each['source']:
pass
else:
item['matches'][each['source']] = fuzz.ratio(item['value'], each['value'])
logging.info("Data with fuzz ratios: {}".format(data))
"""Build a summary array to identify the closest match"""
summary = list()
for item in data:
for match in item['matches']:
row = [item['source'],item['matches'][match], match]
if row in summary or reverse_array(row) in summary:
pass
else:
summary.append(row)
logging.info("Summary table: {}".format(summary))
"""Extract the best match from summary array"""
best_pair = None
for item in summary:
if not best_pair:
best_pair = item
if best_pair and best_pair[1] < item[1]:
best_pair = item[1]
logging.info("Best pair: {}".format(best_pair))
"""Compare len of two candidate values and return the value of shortest"""
a = next(x for x in data if x['source'] == best_pair[0])
b = next(x for x in data if x['source'] == best_pair[2])
logging.info("Two final candidates: {} and {}, returning shortest".format(a, b))
if len(a['value']) > len(b['value']):
return b
else:
return a
实际上,这是跟踪:
INFO:root:Getting the best name, source data: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ'}, {'value': 'Procter & Gamble Co', 'source': 'RTS'}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE'}]
INFO:root:Data with fuzz ratios: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}}, {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE', 'matches': {'WSJ': 76, 'RTS': 78}}]
INFO:root:Summary table: [['WSJ', 97, 'RTS'], ['WSJ', 76, 'NYSE'], ['RTS', 78, 'NYSE']]
INFO:root:Best pair: ['WSJ', 97, 'RTS']
INFO:root:Two final candidates: {'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}} and {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, returning shortest
它可以工作,但是我想知道是否有类似于difftoos
的东西可以更智能地做到这一点?也许
答案 0 :(得分:4)
使用Levenshtein模块:
variants = [
"Procter & Gamble Co.",
"Procter & Gamble co",
"Procter & Gamble Co (The)"
]
import Levenshtein
Levenshtein.median(variants)
# => 'Procter & Gamble Co'