Python pandas数据帧每n行有一个非零和非NaN属性

时间:2016-11-28 06:41:14

标签: python pandas dataframe

我有一个以下列方式定义的pandas数据框:

2009-11-18  500.0
2009-11-19  500.0
2009-11-20    NaN
2009-11-23  500.0
2009-11-24  500.0
2009-11-25    NaN
2009-11-27    NaN
2009-11-30    NaN
2009-12-01  500.0
2009-12-02  500.0
2009-12-03  500.0
2009-12-04  500.0
2009-12-07    NaN
2009-12-08    NaN
2009-12-09  500.0
2009-12-10  500.0
2009-12-11  500.0
2009-12-14  500.0

我的意图是每n行保留一个非NaN元素。例如,如果我的n是4,我将保留2009-11-18 500并将其他所有内容设置为(并包括)2009-11-23为0,我会重复相同的其他元素数组,是否有效, pythonic,矢量化的方式这样做?

为了使这更具体,我打算在数组上最终看起来像这样:

2009-11-18  500.0
2009-11-19  0
2009-11-20  0
2009-11-23  0
2009-11-24  500.0
2009-11-25  0
2009-11-27  0
2009-11-30  0
2009-12-01  500.0
2009-12-02  0
2009-12-03  0
2009-12-04  0
2009-12-07  0
2009-12-08  0
2009-12-09  500.0
2009-12-10  0
2009-12-11  0
2009-12-14  0

2 个答案:

答案 0 :(得分:1)

我认为您可以先使用地面分割np.arange来创建群组,然后使用groupby,并按idxmax获取第一个非NaN值的索引。如果不包含值0

,则where最后获得a
print (np.arange(len(df.index)) // 4)
[0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4]

idx = df.col.groupby([np.arange(len(df.index)) // 4]).idxmin()
print (idx)
0   2009-11-18
1   2009-11-24
2   2009-12-01
3   2009-12-09
4   2009-12-11
Name: col, dtype: datetime64[ns]

df.col = df.col.where(df.index.isin(idx), 0)
print (df)
              col
2009-11-18  500.0
2009-11-19    0.0
2009-11-20    0.0
2009-11-23    0.0
2009-11-24  500.0
2009-11-25    0.0
2009-11-27    0.0
2009-11-30    0.0
2009-12-01  500.0
2009-12-02    0.0
2009-12-03    0.0
2009-12-04    0.0
2009-12-07    0.0
2009-12-08    0.0
2009-12-09  500.0
2009-12-10    0.0
2009-12-11  500.0
2009-12-14    0.0

解决方法如果最后一组的长度不是4,则最后一个值是省略:

arr = np.arange(len(df.index)) // 4
print (arr)
[0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4]

#if equal by last value of array substract 1
arr1 = np.where(arr == arr[-1], arr[-1] - 1, arr)
print (arr1)
[0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 3 3]

idx = df.col.groupby(arr1).idxmin()
print (idx)
0   2009-11-18
1   2009-11-24
2   2009-12-01
3   2009-12-09
Name: col, dtype: datetime64[ns]
df.col = df.col.where(df.index.isin(idx), 0)
print (df)
              col
2009-11-18  500.0
2009-11-19    0.0
2009-11-20    0.0
2009-11-23    0.0
2009-11-24  500.0
2009-11-25    0.0
2009-11-27    0.0
2009-11-30    0.0
2009-12-01  500.0
2009-12-02    0.0
2009-12-03    0.0
2009-12-04    0.0
2009-12-07    0.0
2009-12-08    0.0
2009-12-09  500.0
2009-12-10    0.0
2009-12-11    0.0
2009-12-14    0.0

答案 1 :(得分:1)

IIUC
当您获得下一个值时,重新启动计数器。在这种情况下,我使用发电机。没有矢量化!

def next4(s):
    idx = s.first_valid_index()
    while idx is not None:
        loc = s.index.get_loc(idx)
        yield s.loc[[idx]]
        idx = s.iloc[loc+4:].first_valid_index()

pd.concat(next4(df[1])).reindex(df.index, fill_value=0).to_frame()

enter image description here