我有巨大的df(10亿行)报价数据。我对体积数据感兴趣。我正在用grouby / resample计算每小时总和。我想将这些金额表示为每日金额的百分比。我可以使用groupby / resample计算每日总和。我的问题是我如何才能将小时总和除以每日总和,而无需将每日值重新采样为小时和向前填充。如果我将每小时df除以每日,则熊猫是否根据索引进行广播?谢谢。
addSnapshotListener
答案 0 :(得分:1)
数字键选项,可更快获得结果:
将每小时的体积获取到一个numpy数组:
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty'].to_numpy()
在我的情况下,获得24小时多次滴答声……
h = hourly_vol[22:-16]
现在我们有(24 * n)行,将数据分为24维行:
a = h.to_numpy().reshape(-1,24)
获取每天的总量:
dsum = a.sum(axis=1)
广播到24维数组:
b = np.array([dsum]*24).transpose() # maybe this get a while
获取结果:
result = a/b
并调整形状以插入原始数据框:
result = result.reshape(240)
注意:请记住,在这种情况下,我一开始就删除了16和22行,然后需要插入原始数据帧中的结果:
df.iloc[22:-16]['result'] = result
熊猫解决方案(不适用于非常大的数据集):
熊猫简短答案:
daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']
totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')
result = hourly_vol/totals_col
说明: 我们得到了这样的报价数据,但是需要时间索引(例如,来自binance.com BTC / USDT的时间索引):
df.head(3):
id price qty quoteQty time isBuyerMaker isBestMatch grouper tick_rule dollar_bt abs_theta
0 334736000 9663.87 0.015233 147.209732 2020-06-04 02:37:29.688 False True 0.0 0.0 -147.209732 2.557702e+08
1 334736001 9663.51 0.004417 42.683724 2020-06-04 02:37:29.805 True True 0.0 0.0 -42.683724 2.557701e+08
2 334736002 9663.73 0.016810 162.447301 2020-06-04 02:37:29.813 False True 0.0 1.0 162.447301 2.557703e+08
获取时间索引:
df['time'] = pd.to_datetime(df['time'], unit='ms')
stock_df = df.set_index('time')
每日总量:
daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']
time
2020-06-04 53696.704657
2020-06-05 47788.050050
2020-06-06 32752.950893
2020-06-07 57952.848385
2020-06-08 40664.664125
2020-06-09 46024.001289
2020-06-10 47130.762982
2020-06-11 94418.984730
2020-06-12 50119.066932
2020-06-13 27759.784851
2020-06-14 30055.506608
2020-06-15 57688.820941
Freq: D, Name: qty, dtype: float64
小时总数:
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']
time
2020-06-04 02:00:00 447.253335
2020-06-04 03:00:00 1631.115302
2020-06-04 04:00:00 1703.933586
2020-06-04 05:00:00 1165.990115
2020-06-04 06:00:00 1441.345409
...
2020-06-15 11:00:00 2492.983349
2020-06-15 12:00:00 1971.762135
2020-06-15 13:00:00 3724.376480
2020-06-15 14:00:00 4531.290738
2020-06-15 15:00:00 811.775574
Freq: H, Name: qty, Length: 278, dtype: float64
要在一天内获得一小时的pct,我们需要在进行其他计算后获得每一小时行中的每日总计:
totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')
2020-06-04 02:00:00 47788.050050
2020-06-04 03:00:00 47788.050050
2020-06-04 04:00:00 47788.050050
2020-06-04 05:00:00 47788.050050
2020-06-04 06:00:00 47788.050050
...
2020-06-15 11:00:00 57688.820941
2020-06-15 12:00:00 57688.820941
2020-06-15 13:00:00 57688.820941
2020-06-15 14:00:00 57688.820941
2020-06-15 15:00:00 57688.820941
Freq: 60T, Name: qty, Length: 278, dtype: float64
可以计算出一天中一小时的百分比:
hourly_vol/totals_col
time
2020-06-04 02:00:00 0.009359
2020-06-04 03:00:00 0.034132
2020-06-04 04:00:00 0.035656
2020-06-04 05:00:00 0.024399
2020-06-04 06:00:00 0.030161
...
2020-06-15 11:00:00 0.043214
2020-06-15 12:00:00 0.034179
2020-06-15 13:00:00 0.064560
2020-06-15 14:00:00 0.078547
2020-06-15 15:00:00 0.014072
Freq: H, Name: qty, Length: 278, dtype: float64