我有一个大数据集(> 200k),我试图用一个值替换零序列。具有2个以上零的零序列是一个工件,应通过将其设置为np.NAN来删除。
我已阅读Searching a sequence in a NumPy array但它并不完全符合我的要求,因为我没有静态模式。
np.array([0, 1.0, 0, 0, -6.0, 13.0, 0, 0, 0, 1.0, 16.0, 0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0])
# should be converted to this
np.array([0, 1.0, 0, 0, -6.0, 13.0, NaN, NaN, NaN, 1.0, 16.0, NaN, NaN, NaN, NaN, 1.0, 1.0, 1.0, 1.0])
如果您需要更多信息,请与我们联系。 提前谢谢!
<小时/> 结果:
感谢您的回答,这是我的(非专业)测试结果在288240点运行
divakar took 0.016000ms to replace 87912 points
desiato took 0.076000ms to replace 87912 points
polarise took 0.102000ms to replace 87912 points
因为@Divakar的解决方案是最短也是最快的我接受他的解决方案。
答案 0 :(得分:3)
那基本上是一个binary closing operation
,对收盘差距有一个门槛要求。这是基于它的实现 -
# Pad with ones so as to make binary closing work around the boundaries too
a_extm = np.hstack((True,a!=0,True))
# Perform binary closing and look for the ones that have not changed indiicating
# the gaps in those cases were above the threshold requirement for closing
mask = a_extm == binary_closing(a_extm,structure=np.ones(3))
# Out of those avoid the 1s from the original array and set rest as NaNs
out = np.where(~a_extm[1:-1] & mask[1:-1],np.nan,a)
一种方法可以避免在前面的方法中根据需要使用边界元素,这可能会使处理大型数据集时有点昂贵,就像这样 -
# Create binary closed mask
mask = ~binary_closing(a!=0,structure=np.ones(3))
idx = np.where(a)[0]
mask[:idx[0]] = idx[0]>=3
mask[idx[-1]+1:] = a.size - idx[-1] -1 >=3
# Use the mask to set NaNs in a
out = np.where(mask,np.nan,a)
答案 1 :(得分:1)
以下是您可以用于列表的功能:
import numpy as np
def replace(a_list):
for i in xrange(len(a_list) - 2):
print a_list[i:i+3]
if (a_list[i] == 0 and a_list[i+1] == 0 and a_list[i+2] == 0) or (a_list[i] is np.NaN and a_list[i+1] is np.NaN and a_list[i+2] == 0):
a_list[i] = np.NaN
a_list[i+1] = np.NaN
a_list[i+2] = np.NaN
return a_list
由于列表是在一个方向上遍历的,因此您只需进行两次比较:(0, 0, 0)
或(NaN, NaN, 0)
,因为您将0
替换为NaN
。
答案 2 :(得分:1)
resolve
给你
import numpy as np
from itertools import groupby
l = np.array([0, 1, 0, 0, -6, 13, 0, 0, 0, 1, 16, 0, 0, 0, 0])
def _ret_list( k, it ):
# number of elements in iterator, i.e., length of list of similar items
l = sum( 1 for i in it )
if k==0 and l>2:
# sublist has more than two zeros. replace each zero by np.nan
return [ np.nan ]*l
else:
# return sublist of simliar items
return [ k ]*l
# group items and apply _ret_list on each group
procesed_l = [_ret_list(k,g) for k,g in groupby(l)]
# flatten the list and convert to a numpy array
procesed_l = np.array( [ item for l in procesed_l for item in l ] )
print procesed_l
请注意,每个[ 0. 1. 0. 0. -6. 13. nan nan nan 1. 16. nan nan nan nan]
都会转换为int
。见这里:NumPy or Pandas: Keeping array type as integer while having a NaN value