Question

我有一个要使用itertools.compress给定布尔值掩码的字符串列表。

我需要检查一堆句子，其中包含大量字符串。因此，我想使用itertools节省资源。无法正常工作的部分是通过compress进行的布尔屏蔽。

from itertools import product, starmap, compress

def is_in(string, other_string):
    return string in other_string

to_find = ['hello', 'bye']
some_sentences = ['hello to you', ' hello and bye', 'bye bye']

cartesian = product(to_find, some_sentences)
matched_mask = starmap(is_in, cartesian)
matched = compress(cartesian, matched_mask)
print(list(matched))

actual_result = [('hello', 'hello to you'), ('bye', ' hello and bye')]

expected = [('hello', 'hello to you'), 
           ('hello', 'hello and bye'),
           ('bye', ' hello and bye'), 
           ('bye', 'bye bye')]

Answer 1

itertools.product返回一个迭代器，并且迭代器通常是“单次通过”（可能会有例外）。一旦元素被迭代，就不会再次被迭代。

但是，您在两个地方使用itertools.product的结果，一次用作starmap的参数，一次用作compress的参数。因此，如果starmap从product中“弹出”一个元素，那么下一次compress从同一产品中“弹出”一个元素时，它将收到下一个元素（不是同一元素）。

在大多数情况下，由于它们的“单遍”性质，我建议不要将此类迭代器分配为变量。

一个明显的解决方法是两次生成产品：

matched_mask = starmap(is_in, product(to_find, some_sentences))
matched = compress(product(to_find, some_sentences), matched_mask)
print(list(matched))
# [('hello', 'hello to you'), ('hello', ' hello and bye'), ('bye', ' hello and bye'), ('bye', 'bye bye')]

在这种情况下，我认为生成器函数中的循环比使用多个itertools更具可读性：

from itertools import product

def func(to_find, some_sentences):
    for sub, sentence in product(to_find, some_sentences):
        if sub in sentence:
            yield sub, sentence

然后像这样使用它：

>>> to_find = ['hello','bye']
>>> some_sentences = ['hello to you', ' hello and bye', 'bye bye']
>>> list(func(to_find, some_sentences))
[('hello', 'hello to you'), 
 ('hello', ' hello and bye'), 
 ('bye', ' hello and bye'), 
 ('bye', 'bye bye')]

或者，如果您喜欢单线：

>>> [(sub, sentence) for sub, sentence in product(to_find, some_sentences) if sub in sentence]
[('hello', 'hello to you'),
 ('hello', ' hello and bye'),
 ('bye', ' hello and bye'),
 ('bye', 'bye bye')]

Python的itertools.compress不能完全像布尔掩码一样工作。为什么？

1 个答案: