使用numpy / pandas以矢量化方式应用泛型函数

时间:2016-08-16 14:26:01

标签: pandas numpy time-series vectorization

我正在尝试对代码进行矢量化,并且在很大程度上要归功于一些用户(https://stackoverflow.com/users/3293881/divakarhttps://stackoverflow.com/users/625914/behzad-nouri),我能够取得巨大进步。基本上,我正在尝试将通用函数(在本例中为max_dd_array_ret)应用于我找到的每个区域(有关日期向量化的详细信息,请参阅vectorize complex slicing with pandas dataframe;有关Start, End and Duration of Maximum Drawdown in Python的基本原理http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html {1}})。问题如下:我应该能够获得结果max_dd_array_ret,并且在某种程度上,df_2是我正在寻找的,除了它的悲剧效果,就好像前两个箱子一样合并后,最后一个缺失,因为可以通过查看结果来衡量。

非常欢迎任何解释和解决方法

ranged_DD(asd_1.values, starts, ends+1)

结果:

import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic

def max_dd_array_ret(xs):
    xs = (xs+1).cumprod()
    i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
    j = np.argmax(xs[:i])
    max_dd = abs(xs[j]/xs[i] -1)
    return max_dd if max_dd is not None else 0

def get_ranges_arr(starts,ends):
    # Taken from https://stackoverflow.com/a/37626057/3293881
    counts = ends - starts
    counts_csum = counts.cumsum()
    id_arr = np.ones(counts_csum[-1],dtype=int)
    id_arr[0] = starts[0]
    id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
    return id_arr.cumsum()

def ranged_DD(arr,starts,ends):
    # Get all indices and the IDs corresponding to same groups
    idx = get_ranges_arr(starts,ends)
    id_arr = np.repeat(np.arange(starts.size),ends-starts)

    slice_arr = arr[idx]
    return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]

asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()

index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])

starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)

df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)

print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))

除了前两个之外是相同的: df_2 [ 1.75893509 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472 1.85322338 1.84767224 1.32605754 1.48688414 5.44786663] ranged_DD(asd_1.values, starts, ends+1) [ 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472 1.85322338 1.84767224 1.32605754 1.48688414] vs [ 1.75893509 6.08002911 和最后两个 [ 6.08002911 vs 1.48688414 5.44786663]

。:在详细查看文档({{3}})时,我发现这可能是问题

  

“除了最后一个(最右边)的垃圾箱以外都是半开的。换句话说,   如果箱子是[1,2,3,4],那么第一个箱子是[1,2](包括1,   但不包括2)和第二[2,3]。然而,最后一个箱子是[3,   4],其中包括4.版本0.11.0中的新功能。“

问题是我不知道如何重置它。

0 个答案:

没有答案