我有工作代码可以实现所需的计算结果,但我目前正在使用迭代pandas数组的算法。这显然比纯pandas DataFrame计算慢。想要了解如何使用pandas函数加速计算
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
这基本上是一个带有MTD数字的pandas DataFrame。这纯粹是为了让我们有一些数据可供使用。
我需要的是一个新的DataFrame,它将起始(投资)日期作为列 - 用一些月初值填充它们。索引是所有可能的日期,值应该是YTD数字。我使用此Dataframe作为投资日期的查找/缓存
伪代码
YTD =(1 +最后一个MTD数字)*((1 +最后一个MTD数字)...所有月份到达所需日期
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
我已经对投资日期列表进行了硬编码以简化问题陈述。在我的机器上,这段代码需要2.5到3.5秒。关于如何加快速度的任何建议?
答案 0 :(得分:1)
这是一种应该相当快的方法。很可能有更快/更清洁的东西,但这应该是一种改进。
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)