计算特定列中所有 NaN 值的累积计数

时间:2021-05-05 12:03:37

标签: pandas numpy count nan

我有一个数据框:

# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]

# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df

   ID   election_date   stock_price
0   2   2010-07-31       7.0
1   2   2010-06-30       NaN
2   2   2010-05-31       5.0
3   1   2010-04-30       4.0
4   1   2010-03-31       NaN
5   1   2010-02-28       NaN
6   1   2010-01-31       1.0

我想为每个 np.nan 计算列 stock_price 的所有 ID 的累积计数。

预期结果是:

df

   ID   election_date   stock_price  cum_count_nans
0   2   2010-07-31       7.0            1
1   2   2010-06-30       NaN            0
2   2   2010-05-31       5.0            0   
3   1   2010-04-30       4.0            2  
4   1   2010-03-31       NaN            1
5   1   2010-02-28       NaN            0
6   1   2010-01-31       1.0            0

任何想法如何解决它?

1 个答案:

答案 0 :(得分:2)

想法是通过索引改变顺序,然后在自定义函数中测试缺失值、移位和使用的累积和:

f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
   ID election_date  stock_price cum_count_nans
0   2    2010-07-31          7.0              1
1   2    2010-06-30          NaN              0
2   2    2010-05-31          5.0              0
3   1    2010-04-30          4.0              2
4   1    2010-03-31          NaN              1
5   1    2010-02-28          NaN              0
6   1    2010-01-31          1.0              0