我需要根据最接近某个索引的一侧的相邻行设置数据框中的所有行。上下文是数据框中充满了估计值,行可以根据它们旁边的行进行更正,因此从最佳行开始,良好的结果可以在那里工作。
演示所需结果的示例代码:
{{1}}
有更好的方法吗?我将处理大型数据集,并希望完全避免python循环
答案 0 :(得分:0)
这里的方法在功能上与原始方法非常相似,但是以矢量化方式而不是每个元素基础对整个数组进行索引分为两部分 -
# Perform the subtractions for the two parts, keeping cumsumming for later on
p1 = np.abs(df['guess'][:correct_row_index] - df['guess'][correct_row_index])
p2 = np.abs(df['guess'][correct_row_index:] - df['guess'][correct_row_index])
# Perform cumsum, concatenate and store into the output column
df['cumulative_error'] = np.concatenate((p1[::-1].cumsum()[::-1],p2.cumsum()))
运行时测试
功能定义:
def original_app(df,correct_row_index):
df['cumulative_error'] = 0
for i in range(correct_row_index - 1, -1, -1):
df.iloc[i, df.columns.get_loc('cumulative_error')] = \
abs(df['guess'].iloc[i] - df['guess'].iloc[correct_row_index]) + \
df['cumulative_error'].iloc[i + 1]
for i in range(correct_row_index + 1, len(df), 1):
df.iloc[i, df.columns.get_loc('cumulative_error')] = \
abs(df['guess'].iloc[i] - df['guess'].iloc[correct_row_index]) + \
df['cumulative_error'].iloc[i - 1]
return df
def vectorized_app(df,correct_row_index):
p1 = np.abs(df['guess'][:correct_row_index] - df['guess'][correct_row_index])
p2 = np.abs(df['guess'][correct_row_index:] - df['guess'][correct_row_index])
df['cumulative_error'] = np.concatenate((p1[::-1].cumsum()[::-1],p2.cumsum()))
return df
计时 -
In [304]: # Inputs
...: df = pd.DataFrame(np.random.randint(0,100,2001), columns=['guess'])
...: correct_row_index = 1000
...:
...: # Save copies for benchmarking
...: df_copy1 = df.copy()
...: df_copy2 = df.copy()
...:
In [305]: %timeit original_app(df_copy1,correct_row_index)
1 loops, best of 3: 945 ms per loop
In [306]: %timeit vectorized_app(df_copy2,correct_row_index)
1000 loops, best of 3: 1.33 ms per loop