我有以下代码根据前一行的状态更新当前行:
prev_status = 0
for idx, row in df.iterrows():
if prev_status in [1, 2] and row[column_a] != 0:
row[column_b] += row[column_a]
row[column_c] = 0
row[column_d] = 0
row[column_a] = 0
prev_status = row[status]
df.loc[idx] = row
但是,当运行1GB数据时,这非常慢。有什么方法可以对此进行优化?
答案 0 :(得分:0)
例如,使用shift
df["new_column"] = df["column_name"].shift(x)
这将创建一列,其中值是另一列的值,该列的值偏移了x
行数。这样一来,与对DataFrame中的每一行应用函数相比,对一列进行矢量化计算就更快了。
答案 1 :(得分:0)
尝试一下:
df['previous_status'] = df['status'].shift(1)
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_b'] += df['column_a']
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_c'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_d'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_a'] = 0