我要提取按时间戳分组的数据,并在每个组上进行迭代,然后执行计算并将结果存储到附加的数据框中。当数据量很大时,它会变得非常慢,并且可以绝对并行化,因为每个组都不依赖于另一个组。这是我现在拥有的代码:
def get_distance_from_mid_for_size(self,size):
dfg = self._df.groupby(self._df.index)
spread_df = pd.DataFrame(index=dfg.groups.keys())
for timestamp, data in dfg:
data = data.sort_values(by='level')
mid_price = (data.bid_price[0]*data.ask_size[0] + data.ask_price[0]*data.bid_size[0])/(data.bid_size[0] + data.ask_size[0])
spread_df.loc[timestamp, 'bid_spread'] = mid_price - self.get_average_price(data, size, 'bid')
spread_df.loc[timestamp, 'ask_spread'] = self.get_average_price(data, size, 'ask') - mid_price
spread_df.loc[timestamp, 'mid_price'] = mid_price
return spread_df
我想在其上使用Dask和groupby函数,但似乎仅用于聚合函数?我完全迷失了如何实现上面的功能。
链接到数据在这里: https://www.dropbox.com/s/kkl9pxuf8oypmvz/sample.csv?dl=0