Question

for i in range( 1, len( df ) ):
    if df.loc[i]["identification"] == df.loc[i-1]["identification"] and df.loc[i]["date"] == df.loc[i-1]["date"]:
       df.loc[i,"duplicate"] = 1
    else:
       df.loc[i,"duplicate"] = 0

当处理大尺寸的数据帧时，这个简单的for循环运行速度非常慢。

有什么建议吗？

Answer 1

尝试使用矢量化方法而不是循环：

df['duplicate'] = np.where((df.identification == df.identification.shift())
                           &
                           (df.date == df.date.shift()),
                           1,0)

Answer 2

看起来您只是在检查值是否重复。在这种情况下，您可以使用

df.sort_values(by=['identification', 'date'], inplace=True)
df['duplicate'] = df.duplicated(subset=['identification', 'date']).astype(int)

当使用Pandas数据帧时，如何避免（）：循环缓慢？

2 个答案: