我正在尝试根据预定义的时间间隔(例如年份(基本上是滚动的年度总和))根据多索引计算值的滚动总和。数据在下面编制,其中索引有两个级别:rdate和pdate:
Value Desired Result
Reference Published
2009-06-30 2009-07-31 745.000 745.000
2009-08-13 745.000 745.000
2009-09-30 2009-10-30 0.000 745.000
2009-12-31 2010-02-05 496.000 1241.000
2010-03-02 496.000 1241.000
2010-03-31 2010-04-30 80.000 1321.000
2010-06-30 2010-07-30 30.000 606.000
2010-09-30 2010-11-03 -110.000 496.000
2010-11-07 437.000 1043.000
2010-12-31 2011-02-04 440.000 1483.000
2011-03-31 2011-05-05 1031.000 1938.000
2011-06-30 2011-07-29 53.000 1961.000
2011-09-30 2011-11-04 2.000 1526.000
2011-12-31 2012-02-03 -191.000 895.000
我是熊猫的新手,所以我可能会遗漏一些简单的功能或优化,但这是我的解决方案:
def sum_func(df):
tmp = []
df = df.to_frame()
for ((rdate, pdate),value) in df.itertuples():
t0 = rdate - pd.DateOffset(years=1,days=1)
tmp += [df.loc(axis=0)[t0:rdate,t0:pdate].groupby(level=0,axis=0).tail(1).sum(skipna=False)]
return pd.DataFrame(tmp, index=df.index)
然而,它似乎很慢:
%timeit sum_func(test_df)
10 loops, best of 3: 49.2 ms per loop
我需要为更大的数据集运行数千次,所以我需要一个快速的解决方案。我已经安装了numexpr和瓶颈。索引和分组是92.5%的时间,所以我想主要的优化是在那里进行的:
%lprun -f sum_func sum_func(test_df)
Timer unit: 3.42123e-07 s
Total time: 0.0825231 s
File: <ipython-input-38-b6071946e5e0>
Function: sum_func at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def sum_func(df):
2 1 5 5.0 0.0 tmp = []
3 1 1884 1884.0 0.8 df = df.to_frame()
4 21 965 46.0 0.4 for ((rdate, pdate),value) in df.itertuples():
5 20 10867 543.4 4.5 t0 = rdate - pd.DateOffset(years=1,days=1)
6 20 223104 11155.2 92.5 tmp += [df.loc(axis=0)[t0:rdate,t0:pdate].groupby(level=0,axis=0).tail(1).sum(skipna=False)]
7 1 4384 4384.0 1.8 return pd.DataFrame(tmp, index=df.index)
编辑:
管理将所有内容压缩成一行,但仍然没有太快:
from pandas.tseries.offsets import QuarterBegin
def stm(ddf):
return pd.DataFrame(map(lambda (_,t0):
ddf.loc(axis=0)[:, (t0-QuarterBegin(4)):t0].groupby(level=0,axis=0).tail(1).sum(skipna=False), ddf.index),
index=ddf.index)
减少了近25%:
%timeit stm(test_df)
10 loops, best of 3: 37.6 ms per loop
也许熊猫的索引速度更快?有帮助吗?感谢