熊猫LTM总和与重复

时间:2017-11-10 19:59:48

标签: python pandas pandas-groupby

我正在尝试计算按实体ID分组的数字列的最近12个月的滚动总和。我的数据看起来像这样:

    eID    perioddate  123456  
14  ABC    2011-01-31  31773.0 
74  ABC    2011-01-31  31773.0 
35  ABC    2011-01-31  31773.0 
96  ABC    2011-01-31  31773.0 
57  ABC    2011-04-30  11209.0 
18  ABC    2011-04-30  11209.0 
81  ABC    2011-07-31  11451.0 
44  ABC    2011-07-31  11451.0 
07  ABC    2011-07-31  11451.0 
70  ABC    2011-10-31  20062.0 
34  ABC    2011-10-31  20062.0 
98  ABC    2011-10-31  20062.0 
62  ABC    2012-01-31  42512.0 
26  ABC    2012-01-31  42512.0 
90  ABC    2012-01-31  42512.0 
56  ABC    2012-01-31  42512.0 
24  ABC    2012-04-30  41799.0 
92  ABC    2012-04-30  41799.0 
60  ABC    2012-07-31  41874.0 
28  ABC    2012-07-31  41874.0 
99  ABC    2012-07-31  41874.0 
69  ABC    2012-10-31  46783.0 

我希望每一行都有滚动总和,只要至少有一整年的历史记录,所以我得到的新列看起来像这样:

    eID    perioddate  123456  123456_ltm
14  ABC    2011-01-31  31773.0        
74  ABC    2011-01-31  31773.0        
35  ABC    2011-01-31  31773.0        
96  ABC    2011-01-31  31773.0        
57  ABC    2011-04-30  11209.0        
18  ABC    2011-04-30  11209.0        
81  ABC    2011-07-31  11451.0        
44  ABC    2011-07-31  11451.0        
07  ABC    2011-07-31  11451.0        
70  ABC    2011-10-31  20062.0   74495.0      
34  ABC    2011-10-31  20062.0   74495.0      
98  ABC    2011-10-31  20062.0   74495.0      
62  ABC    2012-01-31  42512.0   85234.0      
26  ABC    2012-01-31  42512.0   85234.0
90  ABC    2012-01-31  42512.0   85234.0
56  ABC    2012-01-31  42512.0   85234.0
24  ABC    2012-04-30  41799.0  115824.0      
92  ABC    2012-04-30  41799.0  115824.0      
60  ABC    2012-07-31  41874.0  146247.0      
28  ABC    2012-07-31  41874.0  146247.0
99  ABC    2012-07-31  41874.0  146247.0
69  ABC    2012-10-31  46783.0  172968.0

从类似的问题我尝试了以下内容:

fx = lambda x: x.rolling(4).sum()
df[id + '_ltm'] = df.groupby(['eID','perioddate'])[id].apply(fx)

不幸的是,我从上面得到了NaN。我错过了一些明显的东西吗?

1 个答案:

答案 0 :(得分:1)

我认为这里不需要分组,除非我遗漏了一些东西。您只需rolling sum + merge

v = df.set_index('perioddate')\
        .drop_duplicates()['123456'].rolling(4).sum().to_frame()

v

              123456
perioddate          
2011-01-31       NaN
2011-04-30       NaN
2011-07-31       NaN
2011-10-31   74495.0
2012-01-31   85234.0
2012-04-30  115824.0
2012-07-31  146247.0
2012-10-31  172968.0

df.merge(v, left_on='perioddate', right_index=True)

    eID perioddate  123456_x  123456_y
14  ABC 2011-01-31   31773.0       NaN
74  ABC 2011-01-31   31773.0       NaN
35  ABC 2011-01-31   31773.0       NaN
96  ABC 2011-01-31   31773.0       NaN
57  ABC 2011-04-30   11209.0       NaN
18  ABC 2011-04-30   11209.0       NaN
81  ABC 2011-07-31   11451.0       NaN
44  ABC 2011-07-31   11451.0       NaN
7   ABC 2011-07-31   11451.0       NaN
70  ABC 2011-10-31   20062.0   74495.0
34  ABC 2011-10-31   20062.0   74495.0
98  ABC 2011-10-31   20062.0   74495.0
62  ABC 2012-01-31   42512.0   85234.0
26  ABC 2012-01-31   42512.0   85234.0
90  ABC 2012-01-31   42512.0   85234.0
56  ABC 2012-01-31   42512.0   85234.0
24  ABC 2012-04-30   41799.0  115824.0
92  ABC 2012-04-30   41799.0  115824.0
60  ABC 2012-07-31   41874.0  146247.0
28  ABC 2012-07-31   41874.0  146247.0
99  ABC 2012-07-31   41874.0  146247.0
69  ABC 2012-10-31   46783.0  172968.0

修改:如果您需要groupby,则可以将所有内容移至dfGroupBy.apply来电:

v = df.set_index('perioddate').groupby('eID', group_keys=False)\
          .apply(lambda x: x.drop_duplicates()['123456'].rolling(4).sum()).T

v

eID              ABC
perioddate          
2011-01-31       NaN
2011-04-30       NaN
2011-07-31       NaN
2011-10-31   74495.0
2012-01-31   85234.0
2012-04-30  115824.0
2012-07-31  146247.0
2012-10-31  172968.0

merge步骤保持不变。