Question

我有一个大数据框，我需要遍历它。但是，非常大的数据帧需要花费很长时间。我知道迭代速度很慢，向量化速度更快。但是，我不知道如何重写一个迭代循环。

我的数据框如下：

print(df_toe.head(10))

 z_toe  dn50_toe  Nod  ht/h  output_ok
0   -3.5  0.067171  NaN   NaN        1.0
1   -3.5  0.082472  NaN   NaN        1.0
2   -3.5  0.095543  NaN   NaN        1.0
3   -3.5  0.196341  NaN   NaN        1.0
4   -3.5  0.232024  NaN   NaN        1.0
5   -3.5  0.347270  NaN   NaN        1.0
6   -3.5  0.353661  NaN   NaN        1.0
7   -3.5  0.404841  NaN   NaN        1.0
8   -3.5  0.632502  NaN   NaN        1.0
9   -3.5  0.922923  NaN   NaN        1.0

带有一些额外的参数：

z_bed = -4.5 
swl = 1.8

通过数据帧df_toe的迭代循环编写如下：

def dftoe_det_2nd(df_toe):

    for i in df_toe.index:
        'Define input variables'
        z_toe = df_toe.get_value(i,'z_toe')
        dn50_toe = df_toe.get_value(i,'dn50_toe')

        'Define restrictions between which it can operate for z_toe/h'
        h = swl - z_bed
        ht = swl - z_toe
        df_toe.set_value(i,'ht/h',abs(ht / h))

        if z_toe < z_bed:
            df_toe.set_value(i,'output_ok',0)

        'Show all waterheights'
        df_toe.set_value(i,'Nod',Nodtoe())

        if 0.90 < abs(ht / h) or 0.4 > abs(ht / h):
            df_toe.set_value(i,'output_ok',0)

        if h > 25:
            df_toe.set_value(i,'output_ok',0)

    df_toe = df_toe[df_toe['output_ok'] == 1]
    del df_toe['output_ok']
    return df_toe

有人知道如何从速度和计算时间上优化它吗？

Answer 1

您可以遵循https://stackoverflow.com/a/28490706/3528612并在循环中尝试openmp。或者，如果您有足够的资源（即更多的处理器），则可以尝试mpi4py并将循环并行化为小块，以加快处理速度

如何向量化熊猫迭代循环

1 个答案: