仅在连续nans超过x的情况下屏蔽

时间:2017-03-29 00:41:07

标签: python pandas numpy

关于answering a question方法我是pandas interpolation。 OP想要仅使用插值,其中连续np.nan s的数量为1。 limit=1的{​​{1}}选项会插入第一个interpolate并停在那里。 OP希望能够说出事实上有多个np.nan并且甚至没有打扰第一个。

我把它归结为只是按原样执行np.nan并在事后掩盖连续的interpolate

问题是:什么是采用1-d数组np.nan和整数a的通用解决方案,并在x或更多连续{{1}的位置生成一个False的布尔掩码}}

考虑一维数组x

np.nan

我希望a面具看起来像这样

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])

我希望x = 2面具看起来像这样

# assume 1 for True and 0 for False 
# a is [  1.  nan  nan  nan   1.  nan   1.   1.  nan  nan   1.   1.]
# mask [  1.   0.   0.   0.   1.   1.   1.   1.   0.   0.   1.   1.]
#                                  ^
#                                  |
#   Notice that this is not masked because there is only one np.nan

我期待从别人的想法中学习; - )

2 个答案:

答案 0 :(得分:1)

我创建了这个通用解决方案

x = 3

没有评论

# assume 1 for True and 0 for False 
# a is [  1.  nan  nan  nan   1.  nan   1.   1.  nan  nan   1.   1.]
# mask [  1.   0.   0.   0.   1.   1.   1.   1.   1.   1.   1.   1.]
#                                  ^              ^    ^
#                                  |              |    |
# Notice that this is not masked because there is less than 3 np.nan's

<强> 演示

import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided

def mask_knans(a, x):
    a = np.asarray(a)
    k = a.shape[0]

    # I will stride n.  I want to pad with 1 less False than
    # the required number of np.nan's
    n = np.append(np.isnan(a), [False] * (x - 1))

    # prepare the mask and fill it with True
    m = np.empty(k, np.bool8)
    m.fill(True)

    # stride n into a number of columns equal to
    # the required number of np.nan's to mask
    # this is essentially a rolling all operation on isnull
    # also reshape with `[:, None]` in preparation for broadcasting
    # np.where finds the indices where we successfully start
    # x consecutive np.nan's
    s = n.strides[0]
    i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]

    # since I prepped with `[:, None]` when I add `np.arange(x)`
    # I'm including the subsequent indices where the remaining
    # x - 1 np.nan's are
    i = i + np.arange(x)

    # I use `pd.unique` because it doesn't sort and I don't need to sort
    i = pd.unique(i[i < k])

    m[i] = False

    return m
import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided

def mask_knans(a, x):
    a = np.asarray(a)
    k = a.shape[0]
    n = np.append(np.isnan(a), [False] * (x - 1))
    m = np.empty(k, np.bool8)
    m.fill(True)
    s = n.strides[0]
    i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
    i = i + np.arange(x)
    i = pd.unique(i[i < k])
    m[i] = False
    return m

答案 1 :(得分:1)

我非常喜欢这样容易掌握但很难&#34; numpyfy&#34;问题!即使这个包对于大多数库来说可能有点太重,但它允许编写这样的&#34; python&#34;类函数而不会失去太多的速度:

import numpy as np
import numba as nb
import math

@nb.njit
def mask_nan_if_consecutive(arr, limit):  # I'm not good at function names :(
    result = np.ones_like(arr)
    cnt = 0
    for idx in range(len(arr)):
        if math.isnan(arr[idx]):
            cnt += 1
            # If we just reached the limit we need to backtrack,
            # otherwise just mask current.
            if cnt == limit:
                for subidx in range(idx-limit+1, idx+1):
                    result[subidx] = 0
            elif cnt > limit:
                result[idx] = 0
        else:
            cnt = 0

    return result

至少如果你使用pure-python,这应该很容易理解,它应该有效:

>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

但关于@nb.njit - 装饰器的真正好处是,这个函数会很快:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2

res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)

%timeit mask_nan_if_consecutive(a, i)  # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i)               # 1000 loops, best of 3: 302 µs per loop

因此,对于短阵列,这大约快50倍,即使差异越来越小,对于更长的阵列来说仍然更快:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2

%timeit mask_nan_if_consecutive(a, i)  # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i)               # 10 loops, best of 3: 154 ms per loop