关于answering a question方法我是pandas
interpolation
。 OP想要仅使用插值,其中连续np.nan
s的数量为1。 limit=1
的{{1}}选项会插入第一个interpolate
并停在那里。 OP希望能够说出事实上有多个np.nan
并且甚至没有打扰第一个。
我把它归结为只是按原样执行np.nan
并在事后掩盖连续的interpolate
。
问题是:什么是采用1-d数组np.nan
和整数a
的通用解决方案,并在x或更多连续{{1}的位置生成一个False的布尔掩码}}
考虑一维数组x
np.nan
我希望a
面具看起来像这样
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
我希望x = 2
面具看起来像这样
# assume 1 for True and 0 for False
# a is [ 1. nan nan nan 1. nan 1. 1. nan nan 1. 1.]
# mask [ 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1.]
# ^
# |
# Notice that this is not masked because there is only one np.nan
我期待从别人的想法中学习; - )
答案 0 :(得分:1)
我创建了这个通用解决方案
x = 3
没有评论
# assume 1 for True and 0 for False
# a is [ 1. nan nan nan 1. nan 1. 1. nan nan 1. 1.]
# mask [ 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
# ^ ^ ^
# | | |
# Notice that this is not masked because there is less than 3 np.nan's
<强> 演示 强>
import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided
def mask_knans(a, x):
a = np.asarray(a)
k = a.shape[0]
# I will stride n. I want to pad with 1 less False than
# the required number of np.nan's
n = np.append(np.isnan(a), [False] * (x - 1))
# prepare the mask and fill it with True
m = np.empty(k, np.bool8)
m.fill(True)
# stride n into a number of columns equal to
# the required number of np.nan's to mask
# this is essentially a rolling all operation on isnull
# also reshape with `[:, None]` in preparation for broadcasting
# np.where finds the indices where we successfully start
# x consecutive np.nan's
s = n.strides[0]
i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
# since I prepped with `[:, None]` when I add `np.arange(x)`
# I'm including the subsequent indices where the remaining
# x - 1 np.nan's are
i = i + np.arange(x)
# I use `pd.unique` because it doesn't sort and I don't need to sort
i = pd.unique(i[i < k])
m[i] = False
return m
import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided
def mask_knans(a, x):
a = np.asarray(a)
k = a.shape[0]
n = np.append(np.isnan(a), [False] * (x - 1))
m = np.empty(k, np.bool8)
m.fill(True)
s = n.strides[0]
i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
i = i + np.arange(x)
i = pd.unique(i[i < k])
m[i] = False
return m
答案 1 :(得分:1)
我非常喜欢numba这样容易掌握但很难&#34; numpyfy&#34;问题!即使这个包对于大多数库来说可能有点太重,但它允许编写这样的&#34; python&#34;类函数而不会失去太多的速度:
import numpy as np
import numba as nb
import math
@nb.njit
def mask_nan_if_consecutive(arr, limit): # I'm not good at function names :(
result = np.ones_like(arr)
cnt = 0
for idx in range(len(arr)):
if math.isnan(arr[idx]):
cnt += 1
# If we just reached the limit we need to backtrack,
# otherwise just mask current.
if cnt == limit:
for subidx in range(idx-limit+1, idx+1):
result[subidx] = 0
elif cnt > limit:
result[idx] = 0
else:
cnt = 0
return result
至少如果你使用pure-python,这应该很容易理解,它应该有效:
>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
但关于@nb.njit
- 装饰器的真正好处是,这个函数会很快:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2
res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)
%timeit mask_nan_if_consecutive(a, i) # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i) # 1000 loops, best of 3: 302 µs per loop
因此,对于短阵列,这大约快50倍,即使差异越来越小,对于更长的阵列来说仍然更快:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2
%timeit mask_nan_if_consecutive(a, i) # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i) # 10 loops, best of 3: 154 ms per loop