Question

我有巨大的df（10亿行）报价数据。我对体积数据感兴趣。我正在用grouby / resample计算每小时总和。我想将这些金额表示为每日金额的百分比。我可以使用groupby / resample计算每日总和。我的问题是我如何才能将小时总和除以每日总和，而无需将每日值重新采样为小时和向前填充。如果我将每小时df除以每日，则熊猫是否根据索引进行广播？谢谢。

addSnapshotListener

Answer 1

数字键选项，可更快获得结果：

将每小时的体积获取到一个numpy数组：

hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty'].to_numpy()

在我的情况下，获得24小时多次滴答声……

h = hourly_vol[22:-16]

现在我们有（24 * n）行，将数据分为24维行：

a = h.to_numpy().reshape(-1,24)

获取每天的总量：

dsum = a.sum(axis=1)

广播到24维数组：

b = np.array([dsum]*24).transpose()  # maybe this get a while

获取结果：

result = a/b

并调整形状以插入原始数据框：

result = result.reshape(240)

注意：请记住，在这种情况下，我一开始就删除了16和22行，然后需要插入原始数据帧中的结果：

df.iloc[22:-16]['result'] = result

熊猫解决方案（不适用于非常大的数据集）：

熊猫简短答案：

daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']
hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']
totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')
result = hourly_vol/totals_col

说明： 我们得到了这样的报价数据，但是需要时间索引（例如，来自binance.com BTC / USDT的时间索引）：

df.head(3):
    id          price       qty         quoteQty    time                    isBuyerMaker isBestMatch    grouper     tick_rule   dollar_bt   abs_theta
0   334736000   9663.87     0.015233    147.209732  2020-06-04 02:37:29.688     False   True    0.0     0.0     -147.209732     2.557702e+08
1   334736001   9663.51     0.004417    42.683724   2020-06-04 02:37:29.805     True    True    0.0     0.0     -42.683724  2.557701e+08
2   334736002   9663.73     0.016810    162.447301  2020-06-04 02:37:29.813     False   True    0.0     1.0     162.447301  2.557703e+08

获取时间索引：

df['time'] = pd.to_datetime(df['time'], unit='ms')
stock_df = df.set_index('time')

每日总量：

daily_vol = stock_df.groupby(pd.Grouper(freq='D', level=0)).sum()['qty']

time
2020-06-04    53696.704657
2020-06-05    47788.050050
2020-06-06    32752.950893
2020-06-07    57952.848385
2020-06-08    40664.664125
2020-06-09    46024.001289
2020-06-10    47130.762982
2020-06-11    94418.984730
2020-06-12    50119.066932
2020-06-13    27759.784851
2020-06-14    30055.506608
2020-06-15    57688.820941
Freq: D, Name: qty, dtype: float64

小时总数：

hourly_vol = stock_df.groupby(pd.Grouper(freq='H', level=0)).sum()['qty']

time
2020-06-04 02:00:00     447.253335
2020-06-04 03:00:00    1631.115302
2020-06-04 04:00:00    1703.933586
2020-06-04 05:00:00    1165.990115
2020-06-04 06:00:00    1441.345409
                          ...     
2020-06-15 11:00:00    2492.983349
2020-06-15 12:00:00    1971.762135
2020-06-15 13:00:00    3724.376480
2020-06-15 14:00:00    4531.290738
2020-06-15 15:00:00     811.775574
Freq: H, Name: qty, Length: 278, dtype: float64

要在一天内获得一小时的pct，我们需要在进行其他计算后获得每一小时行中的每日总计：

totals_col = daily_vol[pd.date_range("2020-06-04 02:00", "2020-06-15 15:00", freq="60min")].fillna(method='ffill').fillna(method='bfill')

2020-06-04 02:00:00    47788.050050
2020-06-04 03:00:00    47788.050050
2020-06-04 04:00:00    47788.050050
2020-06-04 05:00:00    47788.050050
2020-06-04 06:00:00    47788.050050
                           ...     
2020-06-15 11:00:00    57688.820941
2020-06-15 12:00:00    57688.820941
2020-06-15 13:00:00    57688.820941
2020-06-15 14:00:00    57688.820941
2020-06-15 15:00:00    57688.820941
Freq: 60T, Name: qty, Length: 278, dtype: float64

可以计算出一天中一小时的百分比：

hourly_vol/totals_col

time
2020-06-04 02:00:00    0.009359
2020-06-04 03:00:00    0.034132
2020-06-04 04:00:00    0.035656
2020-06-04 05:00:00    0.024399
2020-06-04 06:00:00    0.030161
                         ...   
2020-06-15 11:00:00    0.043214
2020-06-15 12:00:00    0.034179
2020-06-15 13:00:00    0.064560
2020-06-15 14:00:00    0.078547
2020-06-15 15:00:00    0.014072
Freq: H, Name: qty, Length: 278, dtype: float64

熊猫将每小时索引的df除以每日索引的df

1 个答案: