Question

我有一个pandas数据帧，df，单个股票有4,000,000个时间步。

任务是，对于每个时间步，我想确定它是先上升.1％还是下降.1％。所以现在我正在将数据帧转换为numpy数组并循环遍历每个时间步长，从0到4,000,000开始。

对于每个时间步，我会迭代以下时间步，直到找到价格差异为.1％的那一步。如果价格上涨.1％标签是1，如果它下跌.1％标签是0.这需要很长时间。

甚至可以对此进行矢量化吗？ 我尝试了一种动态编程解决方案，以减少时间复杂度，但我不确定是否存在时间复杂度。

high_bid = df['high_bid'].values
high_ask = df['high_ask'].values
low_bid = df['low_bid'].values
low_ask = df['low_ask'].values
open_bid = df['open_bid'].values
open_ask = df['open_ask'].values
labels = np.empty(len(data))
labels[:] = np.nan

for i in range(len(labels)-1):
    for j in range(i+1,len(labels)-1):
        if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
            labels[i] = 1
            break
        elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
            labels[i] = 0
            break
df['direction'] = labels

实施例

                 time  open_bid  open_ask  high_bid  high_ask  low_bid  \
0 2006-09-19 12:00:00   1.26606   1.26621   1.27063   1.27078  1.26504   

1 2006-09-19 13:00:00   1.27010   1.27025   1.27137   1.27152  1.26960   

2 2006-09-19 14:00:00   1.27076   1.27091   1.27158   1.27173  1.26979   

3 2006-09-19 15:00:00   1.27008   1.27023   1.27038   1.27053  1.26708   

4 2006-09-19 16:00:00   1.26816   1.26831   1.26821   1.26836  1.26638   

5 2006-09-19 17:00:00   1.26648   1.26663   1.26762   1.26777  1.26606   

6 2006-09-19 18:00:00   1.26756   1.26771   1.26781   1.26796  1.26733   

7 2006-09-19 19:00:00   1.26763   1.26778   1.26785   1.26800  1.26754   

8 2006-09-19 20:00:00   1.26770   1.26785   1.26825   1.26840  1.26765   

9 2006-09-19 21:00:00   1.26781   1.26796   1.26791   1.26806  1.26703   

   low_ask  direction  
0  1.26519          1  
1  1.26975          1  
2  1.26994          0  
3  1.26723          0  
4  1.26653          0  
5  1.26621          1  
6  1.26748        NaN  
7  1.26769        NaN  
8  1.26780        NaN  
9  1.26718        NaN

我想为所有400万行添加该方向列。

Answer 1

第一个解决方案尝试：Cython。在类似的设置中，只需将%% cython添加到我的代码中，我的速度提高了20-90倍。

在一个Jupyter单元格中

%load_ext Cython
cimport numpy as np
import numpy as np

cpdef func(np.ndarray high_bid, np.ndarray high_ask, np.ndarray low_bid, np.ndarray low_ask, np.ndarray open_bid, np.ndarray open_ask, np.ndarray labels):
    target = 0.001
    cdef Py_ssize_t i, j, n = len(labels)
    for i in range(n):
        for j in range(i+1, n):
            # The following are just a copy paste of your code
            if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
                labels[i] = 1
                break
            elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
                labels[i] = 0
                break

在另一个Jupyter单元格中

func(high_bid, high_ask, low_bid, low_ask, open_bid, open_ask, labels, target)

更多优化

Here is an excellent introduction of cython for pandas

您可以通过添加数据类型（np.ndarray[double]）

来加快速度

第二个解决方案：在high_bid上使用cummax，cummin，在逆序中使用low_ask

target = 0.001
df['highest_bid_from_on'] = df.high_bid.sort_index(ascending=False).cummax().sort_index(ascending=True)
df['lowest_ask_from_on'] = df.low_ask.sort_index(ascending=False).cummin().sort_index(ascending=True)
df['direction'] = np.nan

df.loc[df.open_bid * (1 - target) >= df.lowest_ask_from_on, 'direction'] = 0
df.loc[df.open_ask * (1 + target) <= df.highest_bid_from_on, 'direction'] = 1

Answer 2

您也可以检查 expanding（）窗口函数，但反方向计算每行后的max_future_high_bid和min_future_low_ask：

# 0.1% increae/decrease
target = 0.001

# new column names
new_columns = [ "max_future_high_bid", "min_future_low_ask" ]

df[new_columns] = df[::-1].expanding(1)\                                                          
                          .agg({'high_bid':'max', 'low_ask':'min'})[::-1] \
                          .shift(-1)

# after you have these two values, you can calculate the direction with apply() function
def get_direction(x):
    if x.max_future_high_bid >= (1 + target) * x.open_ask :
        return 1
    elif (1 - target) * x.open_bid  >= x.min_future_low_ask:
        return 0
    else:
        return None

# calculate the direction
df['direction'] = df.apply(get_direction, axis=1)

如何对这个涉及数百万条记录的python循环进行矢量化

2 个答案: