我有以下数据:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
更新: 根据@ user3483203的建议,我已经走了一些,但还不够。我将上面的示例数据修改为新的一行,以更好地说明问题。
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
对于我的要求,该行(SE0000191827,2018-03-31)应具有计算值,因为该行具有四个连续的行,且每个行都有一个值。目前,该行已通过dropna
调用删除,我不知道如何解决该问题。
计算:我的初始数据中的日期是每季度一次。但是,我需要将此数据转换为每个id
的第一个日期和最后一个日期之间的每月行,并针对每个月计算该id
内输入数据的四个最接近的连续行之和。满嘴这导致我进入resample
。请参阅下面的预期输出。我需要按ID和每月日期对数据进行分组。
性能:我现在正在测试的数据仅用于基准测试,但我需要使解决方案保持高性能。我希望可以在10万个以上的唯一id
上运行它,这可能导致大约1000万行。 (100k ID,日期范围可追溯到10年,10年* 12个月=每个ID 120个月,100k * 120 = 1200万行)。
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
对于我的情况,该数据看起来非常好,因为它可以按id
进行很好的分组,并且date
可以按月很好地排列。在这里,看来我可以使用类似df['val'].rolling(4)
的方法,并确保它跳过NaN
的值,并将结果放在新列中。
预期的输出(新列calc
):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
calc
中的第一个日期的结果为26,因为它加上了前三个(8 + 7 + 6 + 5)。该id
的其余部分为NaN,因为四个值不可用。
虽然看起来数据是按id
和date
分组的,但似乎实际上是按date
分组的。我不确定这是如何工作的。我需要按ID和日期对数据进行分组。
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
上面resample
的结果返回一个DatetimeIndexResamplerGroupby
,它没有rolling
...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
该怎么办?我的猜测是我的方法是错误的,但是在仔细阅读文档后,我不确定从哪里开始。