Pandas Groupby.diff用零填充缺少的行

时间:2018-01-10 16:13:41

标签: python pandas missing-data pandas-groupby

我确信这张贴在某个地方,或者说这么简单,我没有看到它,但我没有找到发布的运气。任何帮助都会被大大占用。

我正在尝试做一个groupby.diff,你可以看到。如果缺少日期,我需要显示负值。

df['delta'] = df.groupby(['ID', 'ticker', 'date'])['shares'].diff()

ID  ticker date         shares  delta
A   AAA    3/31/2012    904180  675010
A   AAA    12/31/2011   229170  NaN
A   BBB    3/31/2012    517756  390117
A   BBB    12/31/2011   127639  NaN
A   CCC    12/31/2011   1757    NaN
A   DDD    12/31/2011   500     NaN
B   AAA    3/31/2012    920920  554920
B   AAA   12/31/2011    366000  NaN
B   BBB    3/31/2012    524     393
B   BBB   12/31/2011    131     NaN

我想我需要填充/填充才能得到这个:

ID  ticker date         shares  delta
A   AAA    3/31/2012    904180  675010
A   AAA    12/31/2011   229170  NaN
A   BBB    3/31/2012    517756  390117
A   BBB    12/31/2011   127639  NaN
A   CCC    3/31/2012    0       -1757
A   CCC    12/31/2011   1757    NaN
A   DDD    3/31/2012    0       -500
A   DDD    12/31/2011   500     NaN
B   AAA    3/31/2012    920920  554920
B   AAA   12/31/2011    366000  NaN
B   BBB    3/31/2012    524     393
B   BBB   12/31/2011    131     NaN

再次感谢,

1 个答案:

答案 0 :(得分:2)

使用unstack + stack

New_df=df.set_index(['ID','ticker','date']).unstack('date').stack(dropna=False).reset_index().fillna(0)
New_df['delta'] = New_df.groupby(['ID', 'ticker', 'date'])['shares'].diff()

# you should not groupby date, it will return all NaN after you did diff
New_df['delta'] = New_df.groupby(['ID', 'ticker'])['shares'].diff()
#New_df['delta'] = New_df.groupby(['ID', 'ticker','date'])['shares'].diff()
New_df
Out[316]: 
   ID ticker        date    shares     delta
0   A    AAA  12/31/2011  229170.0       NaN
1   A    AAA   3/31/2012  904180.0  675010.0
2   A    BBB  12/31/2011  127639.0       NaN
3   A    BBB   3/31/2012  517756.0  390117.0
4   A    CCC  12/31/2011    1757.0       NaN
5   A    CCC   3/31/2012       0.0   -1757.0
6   A    DDD  12/31/2011     500.0       NaN
7   A    DDD   3/31/2012       0.0    -500.0
8   B    AAA  12/31/2011  366000.0       NaN
9   B    AAA   3/31/2012  920920.0  554920.0
10  B    BBB  12/31/2011     131.0       NaN
11  B    BBB   3/31/2012     524.0     393.0

排序后

New_df.sort_values(['ID','ticker','date'],ascending=[True,True,False])
Out[318]: 
   ID ticker        date    shares     delta
1   A    AAA   3/31/2012  904180.0  675010.0
0   A    AAA  12/31/2011  229170.0       NaN
3   A    BBB   3/31/2012  517756.0  390117.0
2   A    BBB  12/31/2011  127639.0       NaN
5   A    CCC   3/31/2012       0.0   -1757.0
4   A    CCC  12/31/2011    1757.0       NaN
7   A    DDD   3/31/2012       0.0    -500.0
6   A    DDD  12/31/2011     500.0       NaN
9   B    AAA   3/31/2012  920920.0  554920.0
8   B    AAA  12/31/2011  366000.0       NaN
11  B    BBB   3/31/2012     524.0     393.0
10  B    BBB  12/31/2011     131.0       NaN