我经常发现自己需要一个应用于熊猫数据框的过滤器列表。我应用了每个过滤器并进行了一些计算,但这通常会导致代码变慢。我想优化性能。我创建了一个慢速解决方案的示例,该解决方案可以过滤日期范围列表上的数据框,并为与我的日期范围匹配的行计算列的总和,然后将此值分配给与日期范围的开头匹配的日期:
import numpy as np
import pandas as pd
import datetime
def generateTestDataFrame(N=50, windowSizeInDays=5):
dd = {"AsOfDate" : [],
"WindowEndDate" : [],
"X" : []}
d = datetime.date.today()
for i in range(N):
dd["AsOfDate"].append(d)
dd["WindowEndDate"].append(d + datetime.timedelta(days=windowSizeInDays))
dd["X"].append(float(i))
d = d + datetime.timedelta(days=1)
newDf = pd.DataFrame(dd)
return newDf
def run():
numRows = 50
windowSizeInDays = 5
print "NumRows: %s" % (numRows)
print "WindowSizeInDays: %s" % (windowSizeInDays)
df = generateTestDataFrame(numRows, windowSizeInDays)
newAggColumnName = "SumOverNdays"
df[newAggColumnName] = np.nan # Initialize the column to nan
for i in range(df.shape[0]):
row_i = df.iloc[i]
startDate = row_i["AsOfDate"]
endDate = row_i["WindowEndDate"]
sumAggOverNdays = df.loc[ (df["AsOfDate"] >= startDate) & (df["AsOfDate"] < endDate) ]["X"].sum()
df.loc[df["AsOfDate"] == startDate, newAggColumnName] = sumAggOverNdays
print df.head(10)
if __name__ == "__main__":
run()
这将产生以下输出:
NumRows: 50
WindowSizeInDays: 5
AsOfDate WindowEndDate X SumOverNdays
0 2019-01-15 2019-01-20 0.0 10.0
1 2019-01-16 2019-01-21 1.0 15.0
2 2019-01-17 2019-01-22 2.0 20.0
3 2019-01-18 2019-01-23 3.0 25.0
4 2019-01-19 2019-01-24 4.0 30.0
5 2019-01-20 2019-01-25 5.0 35.0
6 2019-01-21 2019-01-26 6.0 40.0
7 2019-01-22 2019-01-27 7.0 45.0
8 2019-01-23 2019-01-28 8.0 50.0
9 2019-01-24 2019-01-29 9.0 55.0
答案 0 :(得分:1)
尝试使用pandas.DataFrame.apply()进行计算。
doc:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
使用您的代码:
%%timeit
run()
205 ms ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
适应:
%%timeit
windowSizeInDays = 5
rows = 50
df_ = pd.DataFrame(index=range(rows),columns=['AsOfDate','WindowEndDate','X','SumOverNdays'])
asofdate = [datetime.date.today() + datetime.timedelta(days=i) for i in range(rows)]
windowenddate = [i + datetime.timedelta(days=windowSizeInDays) for i in asofdate]
df_['AsOfDate'] = asofdate
df_['WindowEndDate'] = windowenddate
df_['X'] = np.arange(float(df_.shape[0]))
df_['SumOverNdays'] = df_.apply(lambda x: df_.loc[ (df_["AsOfDate"] >= x['AsOfDate']) & (df_["AsOfDate"] < x['WindowEndDate']) ]["X"].sum(), axis=1)
df_
112 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
没有很大的差异,但是在这个特定示例中,我们不能做得更好...