我有一个数据框:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
我想为每个 np.nan
计算列 stock_price
的所有 ID
的累积计数。
预期结果是:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
任何想法如何解决它?
答案 0 :(得分:2)
想法是通过索引改变顺序,然后在自定义函数中测试缺失值、移位和使用的累积和:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0