我有一个数据集需要输出布尔样式数据,只有1和0,对于true或不是true。我正在尝试解析我处理过的简单数据集以查找numpy数组中的信息子集,该数组在一个方向上大约有100,000个元素,在另一个方向上大约有20个元素。我只需沿20轴搜索,但我需要为100,000个条目中的每个条目执行此操作并获取可以映射的输出。
我已经制作了一个由零组成的大小的数组,目的是简单地将匹配的索引指示符标记为1.如果我找到一个很长的集合(I' m)使用长集来处理小集合),我不需要包含任何较小的集合。
样品: [0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,0,1]
我需要在这里找到1组5,从索引2开始,1组3,从索引9开始,不返回5组的任何子集,好像它是一组4或者一组3,从而留下所有已经涵盖的值的结果。即对于3个组,指数2,3,4,5和6都将保持为零。它不需要过于高效,如果无论如何我都不在乎,我只是不需要保留结果。
目前,我使用基本上像这样的代码块进行简单搜索:
values = numpy.array([0,1,1,1,1,1,0,0,1,1,1])
searchval = [1,2]
N = len(searchval)
possibles = numpy.where(values == searchval[0])[0]
print(possibles)
solns = []
for p in possibles:
check = values[p:p+N]
if numpy.all(check == searchval):
solns.append(p)
print(solns)
我一直在试图想出一种方法来重构这个或类似的代码以产生欲望结果。最终目标是搜索9个组到3个组,并且实际上有一个1和0的矩阵,表明索引是否有一个从它开始的组,只要我们想要。
希望有人可以指出我想要做的工作。谢谢!
答案 0 :(得分:0)
这样的东西?
from collections import defaultdict
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
# Keys are number of consecutive 1's, values are indicies
results = defaultdict(list)
found = 0
for i, x in enumerate(samples):
if x == 1:
found += 1
elif i == 0 or found == 0:
continue
else:
results[found].append(i - found)
found = 0
if found:
results[found].append(i - found + 1)
assert results == {1: [15, 17], 3: [9], 5: [2]}
答案 1 :(得分:0)
使用more_itertools
,第三方库(pip install more_itertools
):
import more_itertools as mit
sample = [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
groups = [list(c) for c in mit.consecutive_groups((mit.locate(sample)))]
d = {group[0]: len(group) for group in groups}
d
# {2: 5, 9: 3, 15: 1, 17: 1}
此结果显示“在索引2
是一组5个。组9
是一组3个,”等等。
详细
more_itertools.locate
默认为truthy项目找到索引。more_itertools.consecutive_groups
将连续数字组合在一起。作为dictionary,您可以提取不同类型的信息:
>>> # List of starting indices
>>> list(d)
[2, 9, 15, 17]
>>> # List indices for all lonely groups
>>> [k for k, v in d.items() if v == 1]
[15, 17]
>>> # List indices of groups greater the 2 items
>>> [k for k, v in d.items() if v > 1]
[2, 9]
答案 2 :(得分:0)
这是一个numpy解决方案。我正在使用一个小例子进行演示,但它很容易缩放(20 x 100,000
在我相当适中的笔记本电脑上需要25毫秒,请参阅本文末尾的时间表):
>>> import numpy as np
>>>
>>>
>>> a = np.random.randint(0, 2, (5, 10), dtype=np.int8)
>>> a
array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 1, 1, 0, 0]], dtype=int8)
>>>
>>> padded = np.pad(a,((1,1),(0,0)), 'constant')
# compare array to itself with offset to mark all switches from
# 0 to 1 or from 1 to 0
# then use 'where' to extract the coordinates
>>> colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
>>>
# the lengths of sets are the differences between switch points
>>> lengths = rowinds[1::2] - rowinds[::2]
# now we have the lengths we are free to throw the off-switches away
>>> colinds, rowinds = colinds[::2], rowinds[::2]
>>>
# admire
>>> from pprint import pprint
>>> pprint(list(zip(colinds, rowinds, lengths)))
[(0, 2, 1),
(1, 0, 2),
(2, 1, 2),
(2, 4, 1),
(3, 2, 1),
(4, 0, 5),
(5, 0, 1),
(5, 2, 1),
(5, 4, 1),
(6, 1, 1),
(6, 3, 2),
(7, 4, 1)]
时序:
>>> def find_stretches(a):
... padded = np.pad(a,((1,1),(0,0)), 'constant')
... colinds, rowinds = np.where((padded[:-1] != padded[1:]).T)
... lengths = rowinds[1::2] - rowinds[::2]
... colinds, rowinds = colinds[::2], rowinds[::2]
... return colinds, rowinds, lengths
...
>>> a = np.random.randint(0, 2, (20, 100000), dtype=np.int8)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100)
>>> repeat('find_stretches(a)', **kwds)
[2.475784719004878, 2.4715258619980887, 2.4705517270049313]