id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
2018-11-30 NaN NaN NaT
2018-12-31 SE0000191827 9 2018-12-29
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
(为方便起见,请使用以下pastebin创建此数据:https://pastebin.com/wMU3esEh)
我想在周期为4的rolling
列上应用val
函数,但只计算val
不是NaN
的行。我无法使用dropna
,因为我需要具有NaN
的行也要在新列中接收值。我期望的数据如下。
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT 27.0
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
请注意,行(SE0000191827, 2018-03-31)
的值也应为27.0。原因是该行下面有四个val
值,所以我要对其进行计数。
以下是一种尝试:
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
但是,(SE0000191827, 2018-03-31)
行没有任何值,因为它被放入dropna
中。
据我所知,没有办法让rolling
跳过其中有NaN
的行。有帮助吗?
答案 0 :(得分:1)
我建议您使用groupby(首先删除空值),然后使用df.reindex(index= <#put original index here>)
将原始时间步长推回到索引中,并在计算出的结果上df.fillna
。.这些值可以在calc
中没有值且日期为focb的日期(第一个观察值向后移动)。用熊猫语表示为ffill
和bfill
。
(基本上,将.reindex(df2.index).groupby(level=0).bfill()
添加到您的assign函数的末尾)
答案 1 :(得分:1)
您可以尝试使用每个组(使用Apply)构建系列的变体,而只需在该系列上使用JsonDecoder
来填充相关的NaN值:
bfill
它给出了预期的结果:
def process(sub):
calc = pd.Series(index=sub.index)
calc.loc[~sub.val.isna()] = sub['val'].dropna().rolling(4).sum().shift(-3)
return calc.bfill()
df2['calc'] = df2.groupby(level=0).apply(process).reset_index(level=0, drop=True)