最近I asked a question here我希望在更大的列表中找到子列表。我有一个类似但略有不同的问题。假设我有这个列表:
[['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
我希望使用matches = ['she', 'is', 'student']
进行查询,并打算从查询列表中查看包含matches
元素的所有子列表。链接中问题的唯一区别是我想向range
函数添加find_gappy
参数,因此它将避免检索元素之间的间隙超出指定范围的子列表。例如,在上面的例子中,我想要一个像这样的函数:
matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches, range=2)]
会返回:
[['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
最后一个元素没有显示,因为在she is a very very exceptionally good student
句中,a
和good
之间的距离超出了范围限制。
我怎样才能写出这样的功能?
之间的差距答案 0 :(得分:2)
以下是将match
列表中的项目顺序考虑在内的一种方法:
In [102]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = {j: i for i, j in enumerate(lst)}
...: try:
...: if all(0 <= indices[j]-indices[i] - 1 <= gap_range for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
...:
...:
演示:
In [110]: lst = [['she', 'is', 'a', 'student'],
...: ['student', 'she', 'is', 'a', 'lawer'], # for order check
...: ['she', 'is', 'a', 'great', 'student'],
...: ['i', 'am', 'a', 'teacher'],
...: ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
...:
In [111]:
In [111]: list(find_gappy(lst, ['she', 'is', 'student'], gap_range=2))
Out[111]: [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
如果您的子列表中有重复的字词,您可以使用defaultdict()
来跟踪所有索引,并使用itertools.prodcut
来比较所有已订购字对的产品的差距。
In [9]: from collections import defaultdict
In [10]: from itertools import product
In [10]: def find_gappy(all_sets, matches, gap_range=2):
...: zip_m = list(zip(matches, matches[1:]))
...: for lst in all_sets:
...: indices = defaultdict(list)
...: for i, j in enumerate(lst):
...: indices[j].append(i)
...: try:
...: if all(any(0 <= v - k - 1 <= gap_range for k, v in product(indices[j], indices[i])) for i, j in zip_m):
...: yield lst
...: except KeyError:
...: pass
答案 1 :(得分:1)
链接问题中的技术足够好,你只需要在途中添加间隙,并且由于你不想要全局计数,所以每当遇到匹配时重置计数器。类似的东西:
import collections
def find_gappy(source, matches, max_gap=-1):
matches = collections.deque(matches)
counter = max_gap # initialize as -1 if you want to begin counting AFTER the first match
for word in source:
if word == matches[0]:
counter = max_gap # or remove this for global gap counting
matches.popleft()
if not matches:
return True
else:
counter -= 1
if counter == -1:
return False
return False
data = [['she', 'is', 'a', 'student'],
['she', 'is', 'a', 'lawer'],
['she', 'is', 'a', 'great', 'student'],
['i', 'am', 'a', 'teacher'],
['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]
matches = ['she', 'is', 'student']
x = [i for i in data if find_gappy(i, matches, 2)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]
作为奖励,您可以将其用作原始函数,仅当您将正数传递为max_gap
时才应用间隙计数。