Question

我必须在嵌套列表中找到哪个列表有一个单词并返回一个boolear numpy数组。

nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]

我使用此列表理解来完成它并且它可以正常工作

np.array([word in x for x in nested_list])

但是我正在处理一个包含700k列表的嵌套列表，因此需要花费很多时间。此外，我必须做很多次，列表是静态的但是单词可以改变。

1个循环，列表理解需要0.36s，我需要一种更快的方法，有办法吗？

Answer 1

我们可以展开所有子列表中的元素，为我们提供一维数组。然后，我们只是在展平的1D数组中的每个子列表的范围内查找'c'的任何出现。因此，根据这种理念，我们可以使用两种方法，基于我们如何计算任何c的出现次数。

方法＃1： np.bincount -

的一种方法

lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0

正如问题中所述，nested_list不会在迭代中发生变化，我们可以重复使用所有内容，然后循环进行最后一步。

方法＃2： np.add.reduceat重复使用前一个arr和lens的另一种方法 -

grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0

在循环浏览words列表时，我们可以通过沿轴使用np.add.reduceat并使用broadcasting为2D提供np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0来保持此方法的最终步骤数组布尔值，如此 -

In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]

In [345]: words
Out[345]: ['c', 'b']

In [346]: lens = np.array([len(i) for i in nested_list])
     ...: arr = np.concatenate(nested_list)
     ...: grp_idx = np.append(0,lens[:-1].cumsum())
     ...: 

In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]: 
array([[ True, False,  True,  True],    # matches for 'c'
       [ True,  True,  True, False]])   # matches for 'b'

示例运行 -

{{1}}

Answer 2

当迭代一次（就性能而言）时，生成器表达式会更好。
使用numpy.fromiter函数的解决方案：

nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)

print(arr)

输出：

[1 0 1 1]

https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

Answer 3

完成循环需要多长时间？在我的测试用例中，它只需要几百毫秒。

import random

# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
                for n in range(700000)]

%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop

将每个内部列表减少到一个集合可以节省一些时间......

nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop

一旦将其转换为集合列表，就可以构建一个布尔元组列表。虽然没有实时节省。

%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

在嵌套列表上查找大量数据python

3 个答案: