我有一个大数据框,我需要遍历它。但是,非常大的数据帧需要花费很长时间。我知道迭代速度很慢,向量化速度更快。但是,我不知道如何重写一个迭代循环。
我的数据框如下:
print(df_toe.head(10))
z_toe dn50_toe Nod ht/h output_ok
0 -3.5 0.067171 NaN NaN 1.0
1 -3.5 0.082472 NaN NaN 1.0
2 -3.5 0.095543 NaN NaN 1.0
3 -3.5 0.196341 NaN NaN 1.0
4 -3.5 0.232024 NaN NaN 1.0
5 -3.5 0.347270 NaN NaN 1.0
6 -3.5 0.353661 NaN NaN 1.0
7 -3.5 0.404841 NaN NaN 1.0
8 -3.5 0.632502 NaN NaN 1.0
9 -3.5 0.922923 NaN NaN 1.0
带有一些额外的参数:
z_bed = -4.5
swl = 1.8
通过数据帧df_toe的迭代循环编写如下:
def dftoe_det_2nd(df_toe):
for i in df_toe.index:
'Define input variables'
z_toe = df_toe.get_value(i,'z_toe')
dn50_toe = df_toe.get_value(i,'dn50_toe')
'Define restrictions between which it can operate for z_toe/h'
h = swl - z_bed
ht = swl - z_toe
df_toe.set_value(i,'ht/h',abs(ht / h))
if z_toe < z_bed:
df_toe.set_value(i,'output_ok',0)
'Show all waterheights'
df_toe.set_value(i,'Nod',Nodtoe())
if 0.90 < abs(ht / h) or 0.4 > abs(ht / h):
df_toe.set_value(i,'output_ok',0)
if h > 25:
df_toe.set_value(i,'output_ok',0)
df_toe = df_toe[df_toe['output_ok'] == 1]
del df_toe['output_ok']
return df_toe
有人知道如何从速度和计算时间上优化它吗?
答案 0 :(得分:0)
您可以遵循https://stackoverflow.com/a/28490706/3528612并在循环中尝试openmp。或者,如果您有足够的资源(即更多的处理器),则可以尝试mpi4py并将循环并行化为小块,以加快处理速度