我正在尝试计算按实体ID分组的数字列的最近12个月的滚动总和。我的数据看起来像这样:
eID perioddate 123456
14 ABC 2011-01-31 31773.0
74 ABC 2011-01-31 31773.0
35 ABC 2011-01-31 31773.0
96 ABC 2011-01-31 31773.0
57 ABC 2011-04-30 11209.0
18 ABC 2011-04-30 11209.0
81 ABC 2011-07-31 11451.0
44 ABC 2011-07-31 11451.0
07 ABC 2011-07-31 11451.0
70 ABC 2011-10-31 20062.0
34 ABC 2011-10-31 20062.0
98 ABC 2011-10-31 20062.0
62 ABC 2012-01-31 42512.0
26 ABC 2012-01-31 42512.0
90 ABC 2012-01-31 42512.0
56 ABC 2012-01-31 42512.0
24 ABC 2012-04-30 41799.0
92 ABC 2012-04-30 41799.0
60 ABC 2012-07-31 41874.0
28 ABC 2012-07-31 41874.0
99 ABC 2012-07-31 41874.0
69 ABC 2012-10-31 46783.0
我希望每一行都有滚动总和,只要至少有一整年的历史记录,所以我得到的新列看起来像这样:
eID perioddate 123456 123456_ltm
14 ABC 2011-01-31 31773.0
74 ABC 2011-01-31 31773.0
35 ABC 2011-01-31 31773.0
96 ABC 2011-01-31 31773.0
57 ABC 2011-04-30 11209.0
18 ABC 2011-04-30 11209.0
81 ABC 2011-07-31 11451.0
44 ABC 2011-07-31 11451.0
07 ABC 2011-07-31 11451.0
70 ABC 2011-10-31 20062.0 74495.0
34 ABC 2011-10-31 20062.0 74495.0
98 ABC 2011-10-31 20062.0 74495.0
62 ABC 2012-01-31 42512.0 85234.0
26 ABC 2012-01-31 42512.0 85234.0
90 ABC 2012-01-31 42512.0 85234.0
56 ABC 2012-01-31 42512.0 85234.0
24 ABC 2012-04-30 41799.0 115824.0
92 ABC 2012-04-30 41799.0 115824.0
60 ABC 2012-07-31 41874.0 146247.0
28 ABC 2012-07-31 41874.0 146247.0
99 ABC 2012-07-31 41874.0 146247.0
69 ABC 2012-10-31 46783.0 172968.0
从类似的问题我尝试了以下内容:
fx = lambda x: x.rolling(4).sum()
df[id + '_ltm'] = df.groupby(['eID','perioddate'])[id].apply(fx)
不幸的是,我从上面得到了NaN。我错过了一些明显的东西吗?
答案 0 :(得分:1)
我认为这里不需要分组,除非我遗漏了一些东西。您只需rolling
sum
+ merge
。
v = df.set_index('perioddate')\
.drop_duplicates()['123456'].rolling(4).sum().to_frame()
v
123456
perioddate
2011-01-31 NaN
2011-04-30 NaN
2011-07-31 NaN
2011-10-31 74495.0
2012-01-31 85234.0
2012-04-30 115824.0
2012-07-31 146247.0
2012-10-31 172968.0
df.merge(v, left_on='perioddate', right_index=True)
eID perioddate 123456_x 123456_y
14 ABC 2011-01-31 31773.0 NaN
74 ABC 2011-01-31 31773.0 NaN
35 ABC 2011-01-31 31773.0 NaN
96 ABC 2011-01-31 31773.0 NaN
57 ABC 2011-04-30 11209.0 NaN
18 ABC 2011-04-30 11209.0 NaN
81 ABC 2011-07-31 11451.0 NaN
44 ABC 2011-07-31 11451.0 NaN
7 ABC 2011-07-31 11451.0 NaN
70 ABC 2011-10-31 20062.0 74495.0
34 ABC 2011-10-31 20062.0 74495.0
98 ABC 2011-10-31 20062.0 74495.0
62 ABC 2012-01-31 42512.0 85234.0
26 ABC 2012-01-31 42512.0 85234.0
90 ABC 2012-01-31 42512.0 85234.0
56 ABC 2012-01-31 42512.0 85234.0
24 ABC 2012-04-30 41799.0 115824.0
92 ABC 2012-04-30 41799.0 115824.0
60 ABC 2012-07-31 41874.0 146247.0
28 ABC 2012-07-31 41874.0 146247.0
99 ABC 2012-07-31 41874.0 146247.0
69 ABC 2012-10-31 46783.0 172968.0
修改:如果您需要groupby
,则可以将所有内容移至dfGroupBy.apply
来电:
v = df.set_index('perioddate').groupby('eID', group_keys=False)\
.apply(lambda x: x.drop_duplicates()['123456'].rolling(4).sum()).T
v
eID ABC
perioddate
2011-01-31 NaN
2011-04-30 NaN
2011-07-31 NaN
2011-10-31 74495.0
2012-01-31 85234.0
2012-04-30 115824.0
2012-07-31 146247.0
2012-10-31 172968.0
merge
步骤保持不变。