您的解决方案出现问题

Question

我正在使用python 2.7。我希望从每日回报计算复合回报，而我当前的代码在计算回报时非常慢，所以我一直在寻找可以提高效率的领域。

我想要做的是将两个日期和一个证券传递到价格表中，并使用给定的证券来计算这些日期之间的复合回报。

我有一张价格表（prices_df）：

security_id px_last    asof
    1       3.055   2015-01-05
    1       3.360   2015-01-06
    1       3.315   2015-01-07
    1       3.245   2015-01-08
    1       3.185   2015-01-09

我还有一个包含两个日期和安全性的表（events_df）：

asof            disclosed_on    security_ref_id
2015-01-05  2015-01-09 16:31:00     1
2018-03-22  2018-03-27 16:33:00     3616
2017-08-03  2018-03-27 12:13:00     2591
2018-03-22  2018-03-27 11:33:00     3615
2018-03-22  2018-03-27 10:51:00     3615

使用此表中的两个日期，我想使用价格表来计算回报。

我正在使用的两个功能：

import pandas as pd
# compounds returns
def cum_rtrn(df):
    df_out = df.add(1).cumprod()
    df_out['return'].iat[0] = 1
    return df_out

# calculates compound returns from prices between two dates
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
    df = price_df[price_df.security_id == security]
    df = df.set_index(['asof'])
    df = df.loc[start_date:end_date]
    df['return'] = df.px_last.pct_change()
    df = df[['return']]
    df = cum_rtrn(df)
    return df.iloc[-1][0]

然后我每次使用events_df .iterrows传递calc_comp_returns迭代pandas # example of how function is called start = datetime.datetime.strptime('2015-01-05', '%Y-%m-%d').date() end = datetime.datetime.strptime('2015-01-09', '%Y-%m-%d').date() calc_comp_returns(prices_df, start_date=start, end_date=end, security=1)。但是，这是一个非常缓慢的过程，因为我有10K +迭代，所以我正在寻找改进。解决方案不需要基于{{1}}

{{1}}

Answer 1

这是一个解决方案（在我的计算机上使用一些虚拟数据快了100倍）。

import numpy as np

price_df = price_df.set_index('asof')

def calc_comp_returns_fast(price_df, start_date, end_date, security):
    rows = price_df[price_df.security_id == security].loc[start_date:end_date]
    changes = rows.px_last.pct_change()
    comp_rtrn = np.prod(changes + 1)
    return comp_rtrn

或者，作为一个单行：

 def calc_comp_returns_fast(price_df, start_date, end_date, security):
    return np.prod(price_df[price_df.security_id == security].loc[start_date:end_date].px_last.pct_change() + 1)

不是我事先调用set_index方法，只需要在整个price_df数据框上执行一次。

它更快，因为它不会在每一步重新创建DataFrame。在您的代码中，df几乎在每一行都被新数据帧覆盖。 init进程和垃圾收集（擦除内存中未使用的数据）都需要花费大量时间。

在我的代码中，rows是原始数据的切片或“视图”，它不需要复制或重新初始化任何对象。另外，我直接使用了numpy product函数，它与获取最后一个cumprod元素相同（pandas无论如何都在内部使用np.cumprod。）

建议：如果您使用的是IPython，Jupyter或Spyder，您可以使用魔法%prun calc_comp_returns(...)来查看哪个部分花费的时间最多。我在你的代码上运行它，它是垃圾收集器，使用超过总运行时间的50％！

Answer 2

我对熊猫不太熟悉，但我会给它一个机会。

您的解决方案出现问题

您的解决方案目前进行了大量不必要的计算。这主要归功于以下几行：

    df['return'] = df.px_last.pct_change()

这一行实际上是在计算开始和结束之间每个日期的百分比变化。只是解决这个问题应该会给你一个巨大的加速。您应该只获得起始价格和最终价格并比较两者。这两种价格之间的价格与您的计算完全无关。再一次，我对熊猫的熟悉是零，但你应该做这样的事情：

def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
    df = price_df[price_df.security_id == security]
    df = df.set_index(['asof'])
    df = df.loc[start_date:end_date]
    return 1 + (df['px_last'].iloc(-1) - df['px_last'].iloc(0)

请记住，此代码依赖于price_df按日期排序的事实，因此请务必确保只传递calc_comp_returns日期排序的price_df。

Answer 3

我们将使用pd.merge_asof从prices_df获取价格。但是，当我们这样做时，我们需要按照我们正在使用的日期列对相关数据框进行排序。另外，为方便起见，我将在字典中聚合一些pd.merge_asof参数作为关键字参数。

prices_df = prices_df.sort_values(['asof'])

aed = events_df.sort_values('asof')
ded = events_df.sort_values('disclosed_on')

aokw = dict(
    left_on='asof', right_on='asof',
    left_by='security_ref_id', right_by='security_id'
)

start_price = pd.merge_asof(aed, prices_df, **aokw).px_last

dokw = dict(
    left_on='disclosed_on', right_on='asof',
    left_by='security_ref_id', right_by='security_id'
)

end_price = pd.merge_asof(ded, prices_df, **dokw).px_last

returns = end_price.div(start_price).sub(1).rename('return')
events_df.join(returns)

        asof        disclosed_on  security_ref_id    return
0 2015-01-05 2015-01-09 16:31:00                1  0.040816
1 2018-03-22 2018-03-27 16:33:00             3616       NaN
2 2017-08-03 2018-03-27 12:13:00             2591       NaN
3 2018-03-22 2018-03-27 11:33:00             3615       NaN
4 2018-03-22 2018-03-27 10:51:00             3615       NaN

加快计算收益

3 个答案:

您的解决方案出现问题