我正在尝试编写一个函数,该函数将根据滚动窗口上的特定索引求和/平均值。
我的数据如下所示:
Date (L0) Date - (L1) Value 4-Period-L0-Sum
12/31/2011 1/25/2012 1321
3/31/2012 4/25/2012 1457
6/30/2012 7/25/2012 2056
9/30/2012 10/26/2012 3461 8295
12/31/2012 1/24/2013 2317 9291
3/31/2013 4/24/2013 2008 9842
6/30/2013 7/24/2013 1885 9671
6/30/2013 7/27/2013 1600 9386
9/30/2013 10/29/2013 1955 7880
9/30/2013 11/1/2013 1400 7325
12/31/2013 1/28/2014 1985 6993
12/31/2013 1/30/2014 1985 6993
3/31/2014 4/24/2014 1382 6367
3/31/2014 4/25/2014 1200 6185
6/30/2014 7/23/2014 2378 6963
9/30/2014 10/21/2014 3826 9389
3/31/2015 4/28/2015 2369 9773
3/31/2015 4/30/2015 2369 9773
我正在尝试生成类似pd.rolling_sum(dataframe,window = 4)的内容,除非根据level = 0索引(Date(L0))并使用先前level = 0索引条目中的最后一个值。例如,要计算期间的滚动总和,
[3/31/2014 4/24/2014] = 1382 + 1985 + 1400 + 1600
我的解决方案是使用扩展窗口,groupby 0级,然后取尾和总和:
def custom_sum(datadf, period):
idx_range = np.arange(n)
mm = period * 2 + 4
tmpdf = pd.concat(
map(lambda i:
pd.DataFrame( datadf.iloc[ :i], ].
groupby(level=0,axis=0).tail(1).tail(period).
sum(skipna=False)
).T
, idx_range[period:] ))
tmpdf.index = datadf.index[period-1:]
return tmpdf
虽然它运行得很慢。我相信一定有更好的方法。
一种方法可能是使用pd.exanding_apply(),但它并没有保留数据框以应用该函数,因此没有办法获得正确的groupyby索引..
谢谢!
答案 0 :(得分:1)
您可以使用groupby,如下所示:
import pandas as pd
text = """DateL1 DateL2 Value Sum
12/31/2011 1/25/2012 1321
3/31/2012 4/25/2012 1457
6/30/2012 7/25/2012 2056
9/30/2012 10/26/2012 3461 8295
12/31/2012 1/24/2013 2317 9291
3/31/2013 4/24/2013 2008 9842
6/30/2013 7/24/2013 1885 9671
6/30/2013 7/27/2013 1600 9386
9/30/2013 10/29/2013 1955 7880
9/30/2013 11/1/2013 1400 7325
12/31/2013 1/28/2014 1985 6993
12/31/2013 1/30/2014 1985 6993
3/31/2014 4/24/2014 1382 6367
3/31/2014 4/25/2014 1200 6185
6/30/2014 7/23/2014 2378 6963
9/30/2014 10/21/2014 3826 9389
3/31/2015 4/28/2015 2369 9773
3/31/2015 4/30/2015 2369 9773"""
from io import BytesIO
df = pd.read_csv(BytesIO(text), delim_whitespace=True, parse_dates=[0], index_col=0)
s1 = pd.rolling_sum(df.groupby(df.index, sort=False).Value.last(), 4)
def f(s):
return s - s.iat[-1]
s2 = df.groupby(df.index, sort=False).Value.transform(f).fillna(0)
print s1 + s2
这是输出:
DateL1
2011-12-31 NaN
2012-03-31 NaN
2012-06-30 NaN
2012-09-30 8295
2012-12-31 9291
2013-03-31 9842
2013-06-30 9671
2013-06-30 9386
2013-09-30 7880
2013-09-30 7325
2013-12-31 6993
2013-12-31 6993
2014-03-31 6367
2014-03-31 6185
2014-06-30 6963
2014-09-30 9389
2015-03-31 9773
2015-03-31 9773
dtype: float64