Question

我有一个价格数据的时间序列存储为DataFrame中的Open，High，Low，Close值

我想创建一个新列，其中每个元素都记录了在源数组中找到高点所需的天数。

所以对于像这样的系列

    import pandas as pd
    import numpy as np
    my_vals = pd.Series([10.1, 9.0, 2.4, 8.2, 7.0, 6.1, 5.4, 9.4, 8.7, 11.8, 3.5, 4.7, 5.4, 6.4, 7.8, 8.0, 9.1, 10.2, 11.0, 2.0])

我们会得到这些值 [NaN，1,1,2,1,1,7,1，NaN，1,2,3,4,5,6,7,8,9,1]

我使用rolling_apply编写了这段代码，但它真的很慢，而且我确信这可能有更好的方法。

def countDaysSinceHigherHigh(x):
    aaa = pd.Series(x)

    zzz = x[-1] #looking for values higher than zzz
    bbb = aaa[:-1:] #array without last element
    ccc = bbb[bbb>zzz]  #boolean array with elements that are higher than zzz

    ddd = ccc.last_valid_index()
    if ddd == None:
        return np.NaN #or return 10000 to match window length
    else:
        return aaa.last_valid_index() - ddd

然后计算我们刚刚做的新列

new_col = pd.rolling_apply(my_vals, 10000, countDaysSinceHigherHigh, min_periods = 0 )

非常感谢任何建议：）

Answer 1

你可以通过两个for循环来做到这一点，但最差的时间复杂度可能是O(N**2)。这是一种可以在O(N*log(N))中执行此操作的方法：

算法：

argsort()数组获取index数组
对于index idx中的每个元素，找到index后idx中的最大元素，即idx以内的最大元素。要快速执行此操作，您可以使用SortedList。这是两个实现排序列表的库：

http://www.grantjenks.com/docs/sortedcontainers/sortedlist.html

http://stutzbachenterprises.com/blist/sortedlist.html

以下是代码：

import numpy as np
from sortedcontainers import SortedList

def nearest_hi_value(my_vals):
    index = np.argsort(my_vals)
    sl = SortedList(range(len(index)), load=100)
    res = []
    for idx in index.tolist():
        sl.remove(idx)
        idx2 = sl.bisect_left(idx)
        if idx2 > 0:
            res.append(idx - sl[idx2-1])
        else:
            res.append(0)
    result = np.zeros_like(index)
    result[index] = res
    return result

如果数组中的两个连续元素相同，nearest_hi_value()可能会返回1，但这可以很容易地修复。

以下是结果检查：

my_vals = np.random.rand(1000)
res1 = pd.rolling_apply(my_vals, 10000, countDaysSinceHigherHigh, min_periods = 0 )
res2 = nearest_hi_value(my_vals)
np.allclose(res1, res2)

以下是时间结果：

%timeit pd.rolling_apply(my_vals, 10000, countDaysSinceHigherHigh, min_periods = 0 )
%timeit nearest_hi_value(my_vals)

输出：

1 loops, best of 3: 489 ms per loop
100 loops, best of 3: 10.4 ms per loop

熊猫系列：更快的计算时期回到前一高点

1 个答案: