给出以下两个pandas数据帧
Dataframe 1
open high low close
0 340.649 340.829 340.374 340.511
1 340.454 340.843 340.442 340.843
2 340.521 340.751 340.241 340.474
3 340.197 340.698 340.145 340.420
4 340.332 340.609 340.123 340.128
5 340.092 340.462 339.993 340.207
6 340.179 340.437 339.810 339.983
7 340.296 340.498 339.977 340.036
8 340.461 340.641 340.189 340.367
9 340.404 340.820 340.338 340.589
Dataframe 2
ohlc
0 0.374309
1 0.712707
2 0.791436
3 0.761050
4 0.779006
5 0.765193
6 0.578729
7 0.469613
8 0.385359
9 0.511050
以及以下函数,它接受两个数据帧并进行一些规范化和比较
def normalizeAndCompare(df1, df2):
highest = df1["high"].max()
lowest = df1["low"].min()
df1["high"] = ((df1["high"] - lowest) / (highest - lowest))
df1["low"] = ((df1["low"] - lowest) / (highest - lowest))
df1["open"] = ((df1["open"] - lowest) / (highest - lowest))
df1["close"] = ((df1["close"] - lowest) / (highest - lowest))
df1["ohlc"] = (df1["open"] + df1["high"] + df1["low"] +df1["close"] ) / 4
df1["highstd"] = df1["high"] + df1["ohlc"].rolling(window=10).std()
df1["lowstd"] = df1["low"] - df1["ohlc"].rolling(window=10).std()
df1["highpercent"] = df1["high"] + (df1["high"] * 0.05)
df1["lowpercent"] = df1["low"] - (df1["low"] * 0.05)
df1["highstd"] = df1['highstd'].fillna(value=df1['highpercent'])
df1["lowstd"] = df1['lowstd'].fillna(value=df1['lowpercent'])
result = (np.where(((df2["ohlc"] <= df1['highstd']) & (df2["ohlc"] >= df1['lowstd'])), 1, 0)).sum()
return result
如何更改此功能,以便更有效地运行并更快地返回相同的结果?
鉴于我是python的新手,我非常感谢一些帮助。这是我的设置。也许还有提高效率的方法。我在dataframe1上运行循环:
pd_result = pd.DataFrame(columns=('rowNr', 'result'))
batchSize = 10
for rowNr in range(len(dataframe1)):
df1_temp = dataframe1[rowNr: rowNr + batchSize]
df1_temp = df1_temp.reset_index(drop=True)
result= normalizeAndCompare(df1_temp, dataframe2)
pd_result.loc[rowNr] = [rowNr , result]
我的最终结果应该是pd_result。 还有一点需要注意,dataframe1很大,有几百万行。
答案 0 :(得分:2)
这是一个相当快速的从大多数熊猫函数转换为一个大多数numpy函数(rolling
仍然在熊猫,但其余的是numpy)。对于10,000行,这大约快10倍。
def norm_comp(df1, df2):
open = df1['open'].values
high = df1['high'].values
low = df1['low'].values
close = df1['close'].values
highest = high.max()
lowest = low.min()
high = ((high - lowest) / (highest - lowest))
low = ((low - lowest) / (highest - lowest))
open = ((open - lowest) / (highest - lowest))
close = ((close - lowest) / (highest - lowest))
ohlc = (open + high + low + close) / 4
roll_std = pd.Series(ohlc).rolling(10).std().values
highstd = np.where( np.isnan(roll_std), high * 1.05, high + roll_std )
lowstd = np.where( np.isnan(roll_std), low * .95, low - roll_std )
return np.where(((df2.ohlc.values <= highstd) &
(df2.ohlc.values >= lowstd)), 1, 0).sum()
我将您的示例数据扩展为10,024行,如下所示:
for i in range(10):
df1 = df1.append(df1).reset_index(drop=True)
df2 = df2.append(df2).reset_index(drop=True)
以下是时间安排:
%timeit normalizeAndCompare(df1,df2)
100 loops, best of 3: 9.93 ms per loop
%timeit norm_comp(df1,df2)
1000 loops, best of 3: 957 µs per loop