我有一些代码,其中“for循环”在pandas DataFrame上运行,我想尝试将其矢量化,因为它目前是程序中的瓶颈,可能需要一段时间才能运行。
我有两个DataFrame,'df'和'symbol_data'。
df.head()
Open Time Close Time2 Open Price
Close Time
29/09/2016 00:16 29/09/2016 00:01 29/09/2016 00:16 1.1200
29/09/2016 00:17 29/09/2016 00:03 29/09/2016 00:17 1.1205
29/09/2016 00:18 29/09/2016 00:03 29/09/2016 00:18 1.0225
29/09/2016 00:19 29/09/2016 00:07 29/09/2016 00:19 1.0240
29/09/2016 00:20 29/09/2016 00:15 29/09/2016 00:20 1.0241
和
symbol_data.head()
OPEN HIGH LOW LAST_PRICE
DATE
29/09/2016 00:01 1.1216 1.1216 1.1215 1.1216
29/09/2016 00:02 1.1216 1.1216 1.1215 1.1215
29/09/2016 00:03 1.1215 1.1216 1.1215 1.1216
29/09/2016 00:04 1.1216 1.1216 1.1216 1.1216
29/09/2016 00:05 1.1216 1.1217 1.1216 1.1217
29/09/2016 00:06 1.1217 1.1217 1.1216 1.1217
29/09/2016 00:07 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:08 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:09 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:10 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:11 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:12 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:13 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:14 1.1217 1.1218 1.1217 1.1218
29/09/2016 00:15 1.1218 1.1218 1.1217 1.1217
29/09/2016 00:16 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:17 1.1217 1.1218 1.1217 1.1217
29/09/2016 00:18 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:19 1.1217 1.1217 1.1217 1.1217
29/09/2016 00:20 1.1217 1.1218 1.1217 1.1218
'for循环'如下:
for row in range(len(df)):
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() - df['Open Price'][row]
df['Min Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['LOW'].min() - df['Open Price'][row]
代码基本上从'df'中获取每一行,这是一个单独的交易,并交叉引用'symbol_data'中的数据,以找出在该特定交易的整个生命周期内达到的最小和最大价格...然后减去交易的开盘价从最大值或最小值开始计算交易在开盘时“上行”和“越位”的最大距离。
我无法弄清楚如何对代码进行矢量化 - 我对编码相对较新,并且到目前为止一直使用'for loops'。
有人能指出我正确的方向或提供有关如何实现此向量化的任何提示吗?
感谢。
编辑:
所以我尝试了Grr提供的代码,我可以复制它并让它处理我提供的小测试数据但是当我尝试在我的完整数据上运行它时,我不断收到错误消息:
ValueError Traceback (most recent call last)
<ipython-input-113-19bc1c85f243> in <module>()
93 shared_times = symbol_data[symbol_data.index.isin(df.index)].index
94
---> 95 df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['HIGH'].max() - df['Open Price']
96 df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open Time']) & (shared_times <= df['Close Time2'])]['LOW'].min() - df['Open Price']
97
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\tseries\index.py in wrapper(self, other)
112 elif not isinstance(other, (np.ndarray, Index, ABCSeries)):
113 other = _ensure_datetime64(other)
--> 114 result = func(np.asarray(other))
115 result = _values_from_object(result)
116
C:\Users\stuart.jamieson\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\base.py in _evaluate_compare(self, other)
3350 if isinstance(other, (np.ndarray, Index, ABCSeries)):
3351 if other.ndim > 0 and len(self) != len(other):
-> 3352 raise ValueError('Lengths must match to compare')
3353
3354 # we may need to directly compare underlying
ValueError: Lengths must match to compare
我已将其缩小为以下代码:
shared_times >= df['Open Time']
当我尝试
时shared_times >= df['Open Time'][0]
我明白了:
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True], dtype=bool)
所以我知道所有索引都被正确格式化为“DatetimeIndex”。
type(shared_times[0])
pandas.tslib.Timestamp
type(df['Open Time'][0])
pandas.tslib.Timestamp
type(df['Close Time2'][0])
pandas.tslib.Timestamp
有人可以建议我如何通过此错误消息吗?
答案 0 :(得分:1)
我发现此代码存在一些问题。
为什么需要'关闭日期2'列?它只是索引的副本
迭代Dataframe中的行可能很多easier
如果您使用不带空格的行名称,则可以使用以下方法
for row in df.itertuples():
# print(row)
prices = symbol_data.loc[row.Open_Time:row.Index]
df.loc[row.Index, 'Max Pips'] = prices['HIGH'].max() - row.Open_Price
df.loc[row.Index, 'Min Pips'] = prices['LOW'].min() - row.Open_Price
这应该最小化不同数据帧之间的前进和后退并提高性能,但不是真正的矢量化。
您可以尝试像这样
向量化部分计算price_max = pd.Series(index=df.index, dtype=float)
price_min = pd.Series(index=df.index, dtype=float)
for row in df.itertuples():
# print(row)
prices = symbol_data.loc[row.Open_Time:row.Index]
price_max[row.Index] = prices['HIGH'].max()
price_min[row.Index] = prices['LOW'].min()
df['Max Pips2'] = price_max - df['Open_Price']
df['Min Pips2'] = price_min - df['Open_Price']
但我不认为这会产生很大的差异
答案 1 :(得分:1)
所以在我看来,除了尝试对某些代码进行矢量化之外,还有很多其他内容。让我们分解你正在做的事情。
只是每个循环的第一步:
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]['HIGH'].max() - df['Open Price'][row]
当您执行symbol_data.loc[df['Open Time'][row]:df['Close Time2'][row]]
时,您将在幕后制作一个由pandas.date_range
构建的pandas.DatetimeIndex
pandas。因此,基本上每行都会创建一个包含数万个日期时间的数组。不幸的是,大熊猫无法在整个列上执行此操作,因为您无法执行symbol_data.loc[df['Open Time']:df['Close Time2']]
。因此,在这种情况下,这是阻止您能够对代码进行矢量化的步骤。
首先让我们的代码基线。根据您提供的示例,我将for循环包装到函数calc_time
中并计算其执行时间。
In [202]: def calc_time():
df['Max Pips'] = 0.0
df['Min Pips'] = 0.0
for row in range(len(df1)):
df['Max Pips'][row] = symbol_data.loc[df['Open Time'][row]:df1['Close Time2'][row]]['High'].max() - df['Open Price'][row]
df['Min Pips'][row] = symbol_data.loc[df['Open Time'][row]:df1['Clo se Time2'][row]]['Low'].min() - df['Open Price'][row]
In [203]: %time calc()
/Users/grr/anaconda/bin/ipython:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
sys.exit(IPython.start_ipython())
/Users/grr/anaconda/bin/ipython:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
CPU times: user 281 ms, sys: 3.46 ms, total: 284 ms
Wall time: 284 ms
所以总时间是284毫秒。 5行不太好。更不用说你会得到一连串的警告。我们可以做得更好。
正如我上面提到的,阻止程序是您在日期范围内建立索引的方式。解决此问题的一种方法是找到symbol_data
中同样位于df
的所有索引。这可以使用pandas.Series.isin
方法完成。
In [204]: shared_times = symbol_data[symbol_data.index.isin(df.index)].index
In [205]: shared_times
Out[205]:
Index(['29/09/2016 00:16', '29/09/2016 00:17', '29/09/2016 00:18',
'29/09/2016 00:19', '29/09/2016 00:20'],
dtype='object')
现在我们可以像矢量化一样使用你的逻辑(放下Max Pips
和Min Pips
列以确保实验的纯度):
In [207]: def calc_time_vec():
df['Max Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[(shared_time >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() - df['Open Price'][row]
In [208]: %time calc_time_vec()
CPU times: user 2.98 ms, sys: 167 µs, total: 3.15 ms
Wall time: 3.04 ms
只用了3.15毫秒,速度提高了约90倍!或者如果你想对改进保持非常保守,我们可以将shared_times
的赋值添加到函数本身。
In [210]: def calc_time_vec():
shared_times = symbol_data[symbol_data.index.isin(df.index)].index
df['Max Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[(shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])]['LOW'].min() - df['Open Price']
In [211]: %time calc_time_vec()
CPU times: user 3.23 ms, sys: 171 µs, total: 3.4 ms
Wall time: 3.28 ms
我们的改进仍然是84倍左右,这仍然相当不错。话虽如此,我们仍然可以改善功能。我们两次重复计算.loc
参数的布尔数组。让我们解决这个问题。
In [213]: def calc_time_vec():
shared_times = symbol_data[symbol_data.index.isin(df.index)].index
bool_arr = (shared_times >= df['Open time']) & (shared_times <= df1['Close Time2'])
df['Max Pips'] = symbol_data.loc[bool_arr]['HIGH'].max() - df['Open Price']
df['Min Pips'] = symbol_data.loc[bool_arr]['LOW'].min() - df['Open Price']
In [214]: %time calc_time_vec()
CPU times: user 2.83 ms, sys: 134 µs, total: 2.96 ms
Wall time: 2.87 ms
好的。现在我们降到了2.96毫秒或比原来的功能提高了约96倍。
我希望这能说明如何尝试矢量化和改进像这样的更复杂的功能。很多时候即使代码大部分是矢量化的,仍然可以通过使用内置的大熊猫或NumPy方法找到收益,并确保你不要重复自己。