numpy / pandas diff:将diff平均分布在封闭的nan元素上

时间:2017-03-01 17:44:07

标签: python pandas numpy

我有一个numpy数组(不一定排序):

[2.0, 3.0, nan, nan, nan, 5.0]

我想计算这个数组的差异。最后一个元素5和第二个元素3之间的差异是2.我希望这个2的差异分布在我的numpy数组的封闭nan元素上。如果我尝试numpy.diff(我也尝试使用蒙面数组),我得到结果:

[nan, 1, nan, nan, nan, nan]

结果应如下所示:

[nan, 1, 0.5, 0.5, 0.5, 0.5]

更新:

我得到了上述具体案例的答案,但给定的答案在更一般的形式下无效。例如,如果我们有追踪/领先纳米,当我们有交替的纳米和价值时。例如:

[nan, nan, 2.0, 3.0, nan, nan, nan, 5.0, nan, 6.0, nan]

4 个答案:

答案 0 :(得分:1)

假设您要做的是将输出[i]映射到输入[i]和输入[i-1]的差异,并且在nans的特殊情况下,您要分发nans之间的区别,如果是这个想法,我认为这就是你想要的:

import numpy as np

def arrdiffs(a):
    out = np.array(np.zeros(len(a)))
    diff=np.nan
    difflen=0
    for i,e in enumerate(a):
        if i==0: 
            # in the first cell we always output nan
            out[i]=np.nan
        elif np.isnan(a[i]): 
            # when the input is nan, just increase difflen
            difflen+=1
        elif np.isnan(a[i-1]):
            # when the previous input is nan, but this one isn't
            # distribute the diff across the previous cells and this one
            difflen+=1
            m=float(abs(a[i]-diff))
            for j in range(i-difflen+1,i+1):
                out[j]=m/difflen
            difflen=0
            diff=a[i]
        else:
            # othewise simply do the diff locally between this cell and
            # previous
            out[i]=abs(a[i]-a[i-1])
            diff=a[i] # write down diff in case the next input cells are nan
            difflen=0

    return out

a=np.array([2.0,3.0,np.nan,np.nan,np.nan,5.0])
print arrdiffs(a)

编辑:切换到4个空格标签而不是2,将if / else变为elifs, 在每个分支上添加了评论。

当我运行它时,我得到你的预期输出:

$ python arrdiffs.py
[ nan  1.   0.5  0.5  0.5  0.5]

编辑:将diff的初始值切换为np.nan以考虑我们从一系列nans开始的情况,可能我们只输出nan,直到我们得到至少一些初始值。期待OP澄清目标是什么。在[i-1]为nan但a [i]不是(这是一个bug)的情况下,也将赋值diff切换为[i]。关于OP提供的新测试用例:

[np.nan, np.nan, 2.0, 3.0, np.nan, np.nan, np.nan, 5.0, np.nan, 6.0, np.nan]

此更新代码提供:

>>> [ nan  nan  nan  1.   0.5  0.5  0.5  0.5  0.5  0.5  0. ]

这是OP想要的吗?寻求澄清。

答案 1 :(得分:1)

那应该做的工作:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: a = [2.0, 3.0, np.nan, np.nan, np.nan, 5.0]

In [4]: s = pd.Series(a)

In [5]: result = s.reset_index()\
   ...:           .dropna()\
   ...:           .diff()\
   ...:           .pipe(lambda x: x[0]/x['index'])\
   ...:           .reindex(s.index)\
   ...:           .fillna(method='bfill')

In [6]: result[0] = np.nan

In [7]: result
Out[7]: 
0    NaN
1    1.0
2    0.5
3    0.5
4    0.5
5    0.5
dtype: float64

答案 2 :(得分:1)

我只是先插入nan的。通过这种方式,您可以在这两个步骤之间保持良好的分离,从而更容易地改变插值方式。

import numpy as np

a = np.array([2.0, 3.0, np.nan, np.nan, np.nan, 5.0])
x = np.arange(a.size)

a_filled = np.interp(x, x[np.isfinite(a)], a[np.isfinite(a)])

np.diff(a_filled)

# results in
array([ 1. ,  0.5,  0.5,  0.5,  0.5])

对于更花哨的插值,Pandas可能是一个不错的选择,它也有一个.diff()方法用于Dataframes。

答案 3 :(得分:1)

感谢Rutger Kassies,我一直在研究大熊猫,他们有开箱即用的方法来解决这个一般问题:

将数组转换为dataframe,插入数据帧并获取diff:

    import pandas as pd
    array = [nan, nan, 2.0, 3.0, nan, nan, nan, 5.0, nan, 6.0, nan]
    df = pd.DataFrame(array)
    interpolation = df.interpolate()
    diff = interpolation.diff()

结果是:

[NaN, NaN, NaN, 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0]