我有一个要使用itertools.compress
给定布尔值掩码的字符串列表。
我需要检查一堆句子,其中包含大量字符串。因此,我想使用itertools节省资源。无法正常工作的部分是通过compress进行的布尔屏蔽。
from itertools import product, starmap, compress
def is_in(string, other_string):
return string in other_string
to_find = ['hello', 'bye']
some_sentences = ['hello to you', ' hello and bye', 'bye bye']
cartesian = product(to_find, some_sentences)
matched_mask = starmap(is_in, cartesian)
matched = compress(cartesian, matched_mask)
print(list(matched))
actual_result = [('hello', 'hello to you'), ('bye', ' hello and bye')]
expected = [('hello', 'hello to you'),
('hello', 'hello and bye'),
('bye', ' hello and bye'),
('bye', 'bye bye')]
答案 0 :(得分:3)
itertools.product
返回一个迭代器,并且迭代器通常是“单次通过”(可能会有例外)。一旦元素被迭代,就不会再次被迭代。
但是,您在两个地方使用itertools.product
的结果,一次用作starmap
的参数,一次用作compress
的参数。因此,如果starmap
从product
中“弹出”一个元素,那么下一次compress
从同一产品中“弹出”一个元素时,它将收到下一个元素(不是同一元素)。
在大多数情况下,由于它们的“单遍”性质,我建议不要将此类迭代器分配为变量。
一个明显的解决方法是两次生成产品:
matched_mask = starmap(is_in, product(to_find, some_sentences))
matched = compress(product(to_find, some_sentences), matched_mask)
print(list(matched))
# [('hello', 'hello to you'), ('hello', ' hello and bye'), ('bye', ' hello and bye'), ('bye', 'bye bye')]
在这种情况下,我认为生成器函数中的循环比使用多个itertools
更具可读性:
from itertools import product
def func(to_find, some_sentences):
for sub, sentence in product(to_find, some_sentences):
if sub in sentence:
yield sub, sentence
然后像这样使用它:
>>> to_find = ['hello','bye']
>>> some_sentences = ['hello to you', ' hello and bye', 'bye bye']
>>> list(func(to_find, some_sentences))
[('hello', 'hello to you'),
('hello', ' hello and bye'),
('bye', ' hello and bye'),
('bye', 'bye bye')]
或者,如果您喜欢单线:
>>> [(sub, sentence) for sub, sentence in product(to_find, some_sentences) if sub in sentence]
[('hello', 'hello to you'),
('hello', ' hello and bye'),
('bye', ' hello and bye'),
('bye', 'bye bye')]