提高两个列表上模糊匹配词的速度

时间:2014-08-06 01:23:20

标签: python

我在一个列表中有大约500个项目的列表。我想用最小的项目替换该列表中所有模糊匹配的项目。

有没有办法加快模糊匹配的实施?

注意:我之前发过一个类似的问题,但由于缺乏回应,我正在重新编写它。

我的实施:

def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    """
    #list1 = list(ds1.Title)
    #list2 = list(ds1.Title)
    """
    matchdict = defaultdict(list)

    for i, u in enumerate(list1):

        for i1, u1 in enumerate(list2):

            #Since list orders are the same, this makes sure this isn't the same item.
            if i != i1:

                if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
                    pair = (u, u1)

                    #Because there are potential duplicates, I have to make the key constant.
                    #Otherwise, putting list1 as the key will result in both duplicate items
                    #serving as the key. 

                    """
                    Potential problem:
                    • what if there are diffrent shortstr? 

                    """

                    shortstr = min(pair, key=len)
                    longstr = max(pair, key=len)     
                    matchdict[shortstr].append(longstr)
    return matchdict 

1 个答案:

答案 0 :(得分:2)

我假设您已经安装了python-Levenshtein,这将为您提供4倍的加速。

优化循环和字典访问:

def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    matchdict = dict()

    for i1, i2 in itertools.permutations(range(len(list1), repeat=2)

        u1 = list1[i1]
        u2 = list2[i2]

        if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:    
            shortstr = min(u1, u2, key=len)
            longstr = max(u1, u2, key=len)     
            matchdict.get(shortstr, list).append(longstr)
    return matchdict

除了模糊调用之外,它的速度和它一样快。如果您阅读了源代码,您会看到在每次迭代中都会对每个字符串进行一些预处理。我们可以一次完成所有工作:

def _asciionly(s):
    if PY3:
        return s.translate(translation_table)
    else:
        return s.translate(None, bad_chars)


def full_pre_process(s, force_ascii=False):
    s = _asciionly(s)
    # Keep only Letters and Numbres (see Unicode docs).
    string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
    # Force into lowercase.
    string_out = StringProcessor.to_lower_case(string_out)
    # Remove leading and trailing whitespaces.
    string_out = StringProcessor.strip(string_out)

    out = ''.join(sorted(string_out))
    out.strip()
    return out


def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    matchdict = dict()
    if list1 is not list2:
        list1 = [full_pre_process(each) for each in list1]
        list2 = [full_pre_process(each) for each in list2]
    else:
        # If you are comparing a list to itself, we don't want to overwrite content.
        list1 = [full_pre_process(each) for each in list1]
        list2 = list1

    for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
        u1 = list1[i1]
        u2 = list2[i2]

        if fuzz.partial_ratio(u, u1) >= cutoff:
            pair = (u1, u2)

            shortstr = min(pair, key=len)
            longstr = max(pair, key=len)     
            matchdict.get(shortstr, list).append(longstr)
    return matchdict