Question

我有一个代表相同对象的字符串列表，但是每个字符串可以有一个略有不同的名称。我试图从列表中找到最“共识”的字符串，以将其用作“黄金来源”类型的值。

此类数据的示例可能是：

Procter & Gamble Co.
Procter & Gamble co
Procter & Gamble Co (The)

我实现了一个有效的示例，但其逻辑尚不明确，我想知道是否有库可以帮助我高效地完成此任务。我的算法基本上是寻找值的best pair而不是best one to many的集合（我还真不知道该怎么做）。之所以运作良好，是因为我的列表通常是3-5个元素，但是随着列表的增加，我可能会得出两个相同的错误结果，以得出更好的结果。

我的示例如下：

def best_name(frame):
    """build a dictionary from frame data"""
    data = frame2dict(frame)
    logging.info("Getting the best name, source data: {}".format(data))

    """compare values in each row, skipping comparison with self"""
    for item in data:
        item['matches'] = dict()
        for each in data:
            if item['source'] == each['source']:
                pass
            else:
                item['matches'][each['source']] = fuzz.ratio(item['value'], each['value'])
    logging.info("Data with fuzz ratios: {}".format(data))

    """Build a summary array to identify the closest match"""
    summary = list()
    for item in data:
        for match in item['matches']:
            row = [item['source'],item['matches'][match], match]
            if row in summary or reverse_array(row) in summary:
                pass
            else:
                summary.append(row)
    logging.info("Summary table: {}".format(summary))

    """Extract the best match from summary array"""
    best_pair = None
    for item in summary:
        if not best_pair:
            best_pair = item
        if best_pair and best_pair[1] < item[1]:
            best_pair = item[1]
    logging.info("Best pair: {}".format(best_pair))

    """Compare len of two candidate values and return the value of shortest"""
    a = next(x for x in data if x['source'] == best_pair[0])
    b = next(x for x in data if x['source'] == best_pair[2])
    logging.info("Two final candidates: {} and {}, returning shortest".format(a, b))

    if len(a['value']) > len(b['value']):
        return b
    else:
        return a

实际上，这是跟踪：

INFO:root:Getting the best name, source data: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ'}, {'value': 'Procter & Gamble Co', 'source': 'RTS'}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE'}]
INFO:root:Data with fuzz ratios: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}}, {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE', 'matches': {'WSJ': 76, 'RTS': 78}}]
INFO:root:Summary table: [['WSJ', 97, 'RTS'], ['WSJ', 76, 'NYSE'], ['RTS', 78, 'NYSE']]
INFO:root:Best pair: ['WSJ', 97, 'RTS']
INFO:root:Two final candidates: {'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}} and {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, returning shortest

它可以工作，但是我想知道是否有类似于difftoos的东西可以更智能地做到这一点？也许

Answer 1

使用Levenshtein模块：

variants = [
    "Procter & Gamble Co.",
    "Procter & Gamble co",
    "Procter & Gamble Co (The)"
]

import Levenshtein
Levenshtein.median(variants)
# => 'Procter & Gamble Co'

从列表中找到最“共识”的字符串

1 个答案: