我有一个pandas数据帧,df,单个股票有4,000,000个时间步。
任务是,对于每个时间步,我想确定它是先上升.1%还是下降.1%。所以现在我正在将数据帧转换为numpy数组并循环遍历每个时间步长,从0到4,000,000开始。
对于每个时间步,我会迭代以下时间步,直到找到价格差异为.1%的那一步。如果价格上涨.1%标签是1,如果它下跌.1%标签是0.这需要很长时间。
甚至可以对此进行矢量化吗? 我尝试了一种动态编程解决方案,以减少时间复杂度,但我不确定是否存在时间复杂度。
high_bid = df['high_bid'].values
high_ask = df['high_ask'].values
low_bid = df['low_bid'].values
low_ask = df['low_ask'].values
open_bid = df['open_bid'].values
open_ask = df['open_ask'].values
labels = np.empty(len(data))
labels[:] = np.nan
for i in range(len(labels)-1):
for j in range(i+1,len(labels)-1):
if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
labels[i] = 1
break
elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
labels[i] = 0
break
df['direction'] = labels
实施例
time open_bid open_ask high_bid high_ask low_bid \
0 2006-09-19 12:00:00 1.26606 1.26621 1.27063 1.27078 1.26504
1 2006-09-19 13:00:00 1.27010 1.27025 1.27137 1.27152 1.26960
2 2006-09-19 14:00:00 1.27076 1.27091 1.27158 1.27173 1.26979
3 2006-09-19 15:00:00 1.27008 1.27023 1.27038 1.27053 1.26708
4 2006-09-19 16:00:00 1.26816 1.26831 1.26821 1.26836 1.26638
5 2006-09-19 17:00:00 1.26648 1.26663 1.26762 1.26777 1.26606
6 2006-09-19 18:00:00 1.26756 1.26771 1.26781 1.26796 1.26733
7 2006-09-19 19:00:00 1.26763 1.26778 1.26785 1.26800 1.26754
8 2006-09-19 20:00:00 1.26770 1.26785 1.26825 1.26840 1.26765
9 2006-09-19 21:00:00 1.26781 1.26796 1.26791 1.26806 1.26703
low_ask direction
0 1.26519 1
1 1.26975 1
2 1.26994 0
3 1.26723 0
4 1.26653 0
5 1.26621 1
6 1.26748 NaN
7 1.26769 NaN
8 1.26780 NaN
9 1.26718 NaN
我想为所有400万行添加该方向列。
答案 0 :(得分:0)
第一个解决方案尝试:Cython。在类似的设置中,只需将%% cython添加到我的代码中,我的速度提高了20-90倍。
在一个Jupyter单元格中
%load_ext Cython
cimport numpy as np
import numpy as np
cpdef func(np.ndarray high_bid, np.ndarray high_ask, np.ndarray low_bid, np.ndarray low_ask, np.ndarray open_bid, np.ndarray open_ask, np.ndarray labels):
target = 0.001
cdef Py_ssize_t i, j, n = len(labels)
for i in range(n):
for j in range(i+1, n):
# The following are just a copy paste of your code
if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
labels[i] = 1
break
elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
labels[i] = 0
break
在另一个Jupyter单元格中
func(high_bid, high_ask, low_bid, low_ask, open_bid, open_ask, labels, target)
更多优化
Here is an excellent introduction of cython for pandas
您可以通过添加数据类型(np.ndarray[double]
)
第二个解决方案:在high_bid上使用cummax,cummin,在逆序中使用low_ask
target = 0.001
df['highest_bid_from_on'] = df.high_bid.sort_index(ascending=False).cummax().sort_index(ascending=True)
df['lowest_ask_from_on'] = df.low_ask.sort_index(ascending=False).cummin().sort_index(ascending=True)
df['direction'] = np.nan
df.loc[df.open_bid * (1 - target) >= df.lowest_ask_from_on, 'direction'] = 0
df.loc[df.open_ask * (1 + target) <= df.highest_bid_from_on, 'direction'] = 1
答案 1 :(得分:0)
您也可以检查 expanding()窗口函数,但反方向计算每行后的max_future_high_bid和min_future_low_ask:
# 0.1% increae/decrease
target = 0.001
# new column names
new_columns = [ "max_future_high_bid", "min_future_low_ask" ]
df[new_columns] = df[::-1].expanding(1)\
.agg({'high_bid':'max', 'low_ask':'min'})[::-1] \
.shift(-1)
# after you have these two values, you can calculate the direction with apply() function
def get_direction(x):
if x.max_future_high_bid >= (1 + target) * x.open_ask :
return 1
elif (1 - target) * x.open_bid >= x.min_future_low_ask:
return 0
else:
return None
# calculate the direction
df['direction'] = df.apply(get_direction, axis=1)