如何将元组的所有第一个元素与同一元组中的相应第二个元素进行比较

时间:2019-02-18 14:51:09

标签: python-3.x pandas numpy fuzzywuzzy

我有一个元组列表,如下所示:

terms = [('cat', 'cat'), ('cat', 'bat'), ('cat', 'cat'), ('cat', 'cat'), ('cat', 'bat'), ('cat', 'No Data'), ('cat', 'bat'), ('cat', 'No Data'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'cat'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'No Data'), ('bat', 'bat'), ('bat', 'No Data'), ('cat', 'cat'), ('cat', 'bat'), ('cat', 'cat'), ('cat', 'cat'), ('cat', 'bat'), ('cat', 'No Data'), ('cat', 'bat'), ('cat', 'No Data'), ('cat', 'cat'), ('cat', 'bat'), ('cat', 'cat'), ('cat', 'cat'), ('cat', 'bat'), ('cat', 'No Data'), ('cat', 'bat'), ('cat', 'No Data'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'cat'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'No Data'), ('bat', 'bat'), ('bat', 'No Data'), ('No Data', 'cat'), ('No Data', 'bat'), ('No Data', 'cat'), ('No Data', 'cat'), ('No Data', 'bat'), ('No Data', 'No Data'), ('No Data', 'bat'), ('No Data', 'No Data'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'cat'), ('bat', 'cat'), ('bat', 'bat'), ('bat', 'No Data'), ('bat', 'bat'), ('bat', 'No Data'), ('No Data', 'cat'), ('No Data', 'bat'), ('No Data', 'cat'), ('No Data', 'cat'), ('No Data', 'bat'), ('No Data', 'No Data'), ('No Data', 'bat'), ('No Data', 'No Data')]

我想比较元组中的第一个元素和元组中的第二个元素,并使用Fuzzywuzzy包(https://github.com/seatgeek/fuzzywuzzy)中的方法。

from fuzzywuzzy import fuzz

from fuzzywuzzy import process

print([fuzz.token_set_ratio(ter1[0], ter1[1]) for ter1 in terms])

在列表理解中,还有没有for循环(在上面的语句中)没有其他方法可以使它更快。

我希望结果应该像这样:

[100, 28, 31, 23, 32, 41, 28, 38, 41, 31, 36, 26, 22, 35, 39, 39, 52, 52, 38, 30, 40, 35, 44, 44, 35, 39, 24, 32, 32, 55, 42, 37, 50, 34, 46, 30, 24, 30, 47, 26, 38, 58, 38, 29, 38, 38, 57, 47, 26, 40, 38, 40, 55, 25, 38, 62, 38, 38, 46, 44, 47, 56, 39, 57, 52, 55, 40, 48, 47, 55, 40, 30, 22, 55, 38, 55, 38, 26, 47, 55, 47, 50, 52, 47, 44, 45, 49, 52, 52, 43, 60, 38, 55, 52, 31, 56, 39, 46, 46, 50, 28, 100, 41, 48, 21, 25, 31, 86, 38, 34, 44, 29, 45, 28, 26, 26, 33, 38, 40, 27, 28, 40, 36, 36, 32, 41, 30, 30, 30, 33, 33, 40, 36, 32, 32, 32, 30, 33, 24, 24, 36, 41, 30, 39, 36, 34, 26, 24, 39, 23, 41, 33, 16, 30, 35, 34, 36, 35, 15, 36, 49, 29, 30, 26, 38, 28, 36, 33, 38, 16, 24, 27, 33, 33, 30, 16, 32, 31, 29, 16, 20, 28, 38, 24, 27, 47, 30, 38, 38, 30, 24, 36, 16, 38, 40, 29, 31, 15, 15, 24, 31, 41, 100, 31, 24, 28, 22, 27, 29, 38, 24, 39, 43, 30, 28, 28, 37, 35, 38, 22, 20, 35, 33, 33, 41, 39, 27, 27, 27, 30, 36, 37, 28, 46, 23, 38, 24, 22, 33, 32, 24, 32, 38, 43, 38, 32, 23, 13, 53, 25, 31, 40, 27, 33, 42, 44, 38, 33, 17, 28, 29, 32, 31, 23, 26, 24, 40, 36, 41, 27, 27, 44, 44, 42, 37, 27, 43, 35, 37, 27, 21, 30, 26, 13, 22, 35, 22, 26, 26, 27, 27, 29, 27, 26, 44, 32, 33, 17, 17, 27, 23, 48, 31, 100, 40, 28, 22, 27, 29, 38, 42, 26, 27, 25, 33, 33, 22, 26, 27, 22, 20, 19, 22, 22, 35, 33, 20, 38, 38, 24, 24, 44, 33, 34, 40, 30, 24, 37, 33, 26, 24, 39, 32, 29, 30, 32, 40, 20, 32, 30, 31, 35, 36, 33, 33, 44, 29, 33, 17, 33, 35, 16, 27, 40, 26, 18, 20, 30, 24, 36, 27, 30, 40, 24, 23, 36, 30, 17, 42, 36, 21, 40, 26, 20, 30, 35, 22, 26, 26, 27, 20, 29, 36, 26, 25, 16, 28, 17, 17, 27, 32, 21, 24, 40, 100, 43, 34, 22, 30, 32, 31, 27, 33, 31, 34, 34, 31, 27, 22, 31, 26, 20, 29, 29, 36, 23, 28, 33, 33, 31, 31, 38, 29, 29, 35, 26, 24, 15, 28, 20, 24, 40, 28, 30, 22, 25, 29, 28, 22, 31, 32, 26, 38, 35, 24, 32, 29, 23, 27, 23, 24, 25, 28, 29, 27, 31, 28, 38, 30, 38, 28, 31, 32, 19, 24, 38, 26, 27, 38, 38, 32, 31, 27, 28, 23, 36, 22, 27, 27, 28, 34, 20, 38, 27, 26, 25, 34, 27, 27, 29, 41, 25, 28, 28, 43, 100, 36, 20, 43, 34, 28, 24, 25, 23, 36, 36, 33, 23, 40, 27, 33, 29, 41, 41, 38, 36, 26, 35, 35, 22, 22, 53, 36, 32, 26, 36, 34, 40, 30, 29, 27, 47, 35, 39, 32, 27, 37, 30, 34, 33, 28, 23, 40, 30, 30, 34, 31, 28, 31, 36, 32, 29, 37, 37, 31, 39, 30, 39, 32, 40, 30, 40, 29, 33, 35, 40, 36, 23, 34, 40, 29, 33, 31, 30, 27, 42, 35, 31, 31, 35, 30, 27, 40, 31, 40, 29, 31, 31, 31, 24, 28, 31, 22, 22, 34, 36, 100, 34, 32, 39, 37, 29, 26, 32, 35, 35, 38, 24, 26, 38, 32, 29, 39, 39, 36, 30, 30, 30, 30, 28, 28, 32, 39, 27, 49, 35, 37, 27, 50, 34, 31, 44, 30, 32, 32, 27, 31, 35, 29, 32, 22, 32, 38, 35, 30, 38, 27, 25, 30, 30, 32, 40, 33, 31, 42, 33, 25, 33, 36, 38, 40, 32, 29, 37, 34, 38, 32, 30, 38, 38, 33, 40, 42, 35, 38, 40, 38, 42, 42, 30, 30, 31, 38, 42, 33, 40, 30, 30, 30, 25, 38, 86, 27, 27, 22, 20, 34, 100, 31, 27, 32, 33, 46, 31, 34, 34, 26, 18, 42, 32, 31, 41, 26, 26, 27, 43, 39, 29, 29, 50, 45, 21, 43, 22, 43, 28, 23, 26, 39, 29, 53, 33, 54, 26, 38, 45, 43, 44, 24, 39, 38, 43, 50, 34, 52, 42, 45, 46, 46, 43, 58, 44, 42, 43, 46, 50, 24, 55, 49, 50, 44, 37, 29, 50, 37, 50, 41, 29, 41, 50, 41, 39, 46, 44, 35, 39, 46, 46, 46, 46, 34, 50, 50, 46, 37, 44, 55, 46, 46, 50, 41, 38, 29, 29, 30, 43, 32, 31, 100, 29, 34, 31, 44, 25, 27, 27, 29, 26, 36, 23, 33, 33, 36, 36, 38, 36, 34, 31, 31, 34, 34, 34, 23, 28, 37, 29, 28, 34, 37, 21, 28, 36, 31, 28, 33, 38, 33, 26, 30, 25, 35, 33, 27, 19, 31, 35, 36, 39, 32, 36, 43, 24, 41, 33, 26, 39, 32, 29, 38, 27, 32, 34, 42, 29, 27, 27, 29, 19, 35, 27, 35, 29, 26, 26, 29, 38, 31, 26, 26, 27, 42, 29, 27, 26, 35, 24, 45, 32, 32, 27, 31, 34, 38, 38, 32, 34, 39, 27, 29, 100, 36, 32, 32, 35, 39, 39, 37, 26, 43, 22, 40, 23, 44, 44, 41, 44, 35, 38, 38, 30, 30, 37, 33, 34, 34, 30, 28, 30, 20, 32, 29, 39, 32, 43, 26, 21, 29, 33, 32, 35, 31, 30, 36, 17, 23, 38, 24, 22, 17, 28, 24, 40, 27, 29, 17, 24, 20, 30, 29, 36, 27, 22, 31, 42, 28, 36, 26, 26, 32, 36, 21, 35, 17, 33, 22, 35, 38, 17, 17, 32, 33, 24, 36, 17, 31, 40, 22, 17, 17, 18, 36, 44, 24, 42, 31, 28, 37, 32, 34, 36, 100, 32, 27, 30, 33, 33, 29, 20, 23, 35, 38, 27, 37, 37, 34, 28, 21, 32, 32, 25, 25, 35, 33, 33, 48, 33, 35, 41, 32, 26, 37, 37, 32, 34, 33, 38, 33, 32, 31, 38, 24, 34, 28, 26, 36, 31, 37, 39, 27, 33, 39, 25, 31, 33, 20, 30, 32, 30, 24, 28, 32, 24, 38, 30, 36, 28, 33, 40, 36, 28, 27, 38, 20, 32, 35, 38, 36, 20, 20, 27, 32, 42, 28, 20, 36, 25, 28, 27, 27, 28, 26, 29, 39, 26, 27, 24, 29, 33, 31, 32, 32, 100, 43, 36, 39, 39, 38, 29, 38, 19, 22, 32, 34, 34, 36, 34, 32, 33, 33, 37, 37, 31, 39, 35, 30, 35, 29, 38, 23, 33, 26, 28, 33, 42, 31, 33, 30, 17, 33, 22, 26, 31, 15, 28, 38, 43, 43, 34, 21, 34, 31, 33, 29, 30, 29, 32, 23, 32, 31, 15, 23, 25, 48, 37, 38, 15, 35, 29, 28, 15, 37, 31, 29, 17, 19, 22, 29, 29, 29, 24, 23, 26, 15, 29, 27, 33, 34, 21, 21, 22, 22, 45, 43, 27, 33, 25, 26, 46, 44, 32, 27, 43, 100, 35, 43, 43, 37, 24, 67, 32, 35, 67, 38, 38, 40, 69, 67, 46, 46, 36, 36, 32, 26, 30, 35, 41, 23, 32, 34, 33, 30, 33, 33, 36, 31, 33, 26, 39, 33, 24, 32, 35, 30, 34, 37, 42, 30, 34, 29, 34, 36, 33, 32, 26, 29, 27, 24, 36, 31, 30, 24, 32, 36, 41, 37, 30, 34, 24, 33, 30, 24, 35, 29, 39, 26, 31, 25, 29, 29, 42, 24, 38, 30, 29, 33, 33, 43, 29, 29, 18, 35, 28, 30, 25, 31, 23, 32, 31, 25, 35, 30, 36, 35, 100, 65, 65, 88, 32, 39, 53, 59, 30, 65, 65, 69, 36, 37, 67, 67, 38, 38, 29, 36, 41, 29, 33, 38, 29, 23, 36, 29, 31, 27, 33, 33, 34, 33, 23, 31, 37, 25, 33, 22, 26, 39, 35, 32, 32, 16, 32, 25, 31, 40, 33, 22, 26, 32, 30, 33, 22, 27, 34, 34, 34, 42, 22, 39, 27, 38, 22, 27, 30, 22, 23, 24, 37, 31, 22, 22, 27, 27, 25, 22, 22, 30, 31, 44, 16, 16, 28, 39, 26, 28, 33, 34, 36, 35, 34, 27, 39, 33, 39, 43, 65, 100, 100, 100, 36, 43, 53, 65, 32, 70, 70, 69, 43, 33, 65, 65, 33, 33, 43, 39, 31, 31, 32, 33, 27, 30, 39, 27, 49, 34, 42, 32, 36, 31, 30, 38, 36, 33, 40, 31, 29, 38, 38, 35, 34, 18, 30, 27, 34, 43, 31, ...........]

1 个答案:

答案 0 :(得分:2)

由于这会比较每个字符串,因此我认为不需要for循环的可能性很小

如果重复次数很多,您可以通过记忆来加快速度:

from functools import lru_cache

@lru_cache(None)
def fuzzy_match(terms):
    if len(terms) == 1:
        return 100
    return fuzz.token_set_ratio(*terms)

对于您的数据集,它给出:

原始版本:

 %timeit [fuzz.token_set_ratio(*t) for t in terms]
927 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

带有备注:

%timeit fuzzy_match.cache_clear(); [fuzzy_match(t) for t in terms]
155 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

具有备忘录,并使用事实fuzz.token_set_ratio(a, b) == fuzz.token_set_ratio(b, a)

%timeit fuzzy_match.cache_clear(); [fuzzy_match(frozenset(t)) for t in terms]
87.7 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

或加速10倍左右

线程

如果这还不能解决问题,则需要对数据进行分块,然后在多个线程中进行解析。