Question

我有这个循环遍历数据帧并创建一个累积值。我的数据框中有大约450k行，完成时间超过30分钟。

以下是我的数据框的负责人：

timestamp  open   high  low     close   volume  vol_thrs    flg

1970-01-01 09:30:59 136.01  136.08  135.94  136.030 5379100 0.0 0.0
1970-01-01 09:31:59 136.03  136.16  136.01  136.139 759900  0.0 0.0
1970-01-01 09:32:59 136.15  136.18  136.10  136.180 609000  0.0 0.0
1970-01-01 09:33:59 136.18  136.18  136.07  136.100 510900  0.0 0.0
1970-01-01 09:34:59 136.11  136.15  136.05  136.110 306400  0.0 0.0

timestamp列是索引。

关于我如何更快地做出任何想法？

for (i, (idx, row)) in enumerate(df.iterrows()):
    if i == 0:
        tmp_cum = df.loc[idx, 'volume']
    else:
        tmp_cum = tmp_cum + df.loc[idx, 'volume']

    if tmp_cum >= df.loc[idx, 'vol_thrs']:
        tmp_cum = 0
        df.loc[idx, 'flg'] = 1

Answer 1

尝试使用df.at代替df.loc，因为：

for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
    tmp_cum = df.at[idx, 'volume']
else:
    tmp_cum = tmp_cum + df.at[idx, 'volume']

if tmp_cum >= df.at[idx, 'vol_thrs']:
    tmp_cum = 0
    df.at[idx, 'flg'] = 1

df.at理论上应该表现得更好。如果您正在访问单个数据值，则df.at会更好，这在您的函数中就是这种情况。 df.loc会让你进行切片，但df.at赢了。

如何更快地遍历此pandas数据帧？

1 个答案: