我有一个复杂的功能,它需要永远在大型熊猫df上运行,我找不到加速它的方法。你们有什么小费吗? 我使用过numba,但这显然还不够。我也尝试使用索引引用来最大限度地利用pandas容量,但我确信还有其他方法我没有实现。
这个函数的作用基本上是带有随机时间间隔事件的df,并将其标准化为第二个间隔事件。有三种不同类型的事件(TRADE,BEST_BID,BEST_ASK),因此每一秒我应该有三行(每个事件一个)。如果在该秒期间没有发生特定类型的事件,我们将重复使用先前的值。
感谢您的帮助!
@numba.jit
def convertTicksToSeconds(dataFrame_df):
idxs = dataFrame_df['data_all'][dataFrame_df['data_all']['time_change']].index.tolist()
previous_idx = 0
progress_i = 0
#Creation of the df to holfd the normalized data
normalized_Data = pandas.DataFrame( columns=['timestamp', 'B.A.T', 'price', 'volume', 'asset'])
BAT_type =['TRADE','BEST_BID','BEST_ASK']
tmp_time = dataFrame_df['data_all']['timestamp'][0]
data_TRADE = {'timestamp': tmp_time, 'B.A.T': 'TRADE', 'price': 0, 'volume': 0, 'asset': dataFrame_df['data_all']['asset'][0]}
data_BID = {'timestamp': tmp_time, 'B.A.T': 'BEST_BID', 'price': 0, 'volume': 0, 'asset': dataFrame_df['data_all']['asset'][0]}
data_ASK = {'timestamp': tmp_time, 'B.A.T': 'BEST_ASK', 'price': 0, 'volume': 0, 'asset': dataFrame_df['data_all']['asset'][0]}
for BAT in BAT_type:
for idx in idxs:
if dataFrame_df['data_all'][previous_idx:idx-1][dataFrame_df['data_all']['B.A.T'] == BAT].empty == False:
timestamp = dataFrame_df['data_all']['timestamp'][idx]
price = dataFrame_df['data_all']['price'][previous_idx:idx-1][dataFrame_df['data_all']['B.A.T'] == BAT]
volume = dataFrame_df['data_all']['volume'][previous_idx:idx-1][dataFrame_df['data_all']['B.A.T'] == BAT]
total_volume = volume.sum()
weighted_price = price * volume
weighted_price = weighted_price.sum() / total_volume
volume = volume.mean()
asset = dataFrame_df['data_all']['asset'][idx]
if BAT == 'TRADE':
data_TRADE = {'timestamp': timestamp, 'B.A.T': BAT, 'price': weighted_price, 'volume': volume, 'asset': asset}
elif BAT == 'BEST_BID':
data_BID = {'timestamp': timestamp, 'B.A.T': BAT, 'price': weighted_price, 'volume': volume, 'asset': asset}
elif BAT == 'BEST_ASK':
data_ASK = {'timestamp': timestamp, 'B.A.T': BAT, 'price': weighted_price, 'volume': volume, 'asset': asset}
print data_TRADE
print data_BID
print data_ASK
normalized_Data.append(data_TRADE, ignore_index=True)
normalized_Data.append(data_BID, ignore_index=True)
normalized_Data.append(data_ASK, ignore_index=True)
previous_idx = idx
progress_i += 1
tmp = (progress_i / len(idxs))*100
print ('Progress : ' + str(tmp) + ' %')
return normalized_Data