Question

我有一个pandas数据框，我想根据是否满足某些条件进行过滤。我运行了一个循环和.apply()并使用%%timeit来测试速度。数据集大约有45000行。循环的代码段是：

%%timeit
qualified_actions = []
for row in all_actions.index:
    if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
        qualified_actions.append(True)
    else:
        qualified_actions.append(False)

每循环1.44 s±3.7 ms（平均值±标准偏差，7次运行，每次循环1次）

.apply()是：

%%timeit
qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)

每回路6.71 s±54.6 ms（平均值±标准偏差，7次运行，每次1次循环）

我认为.apply()应该比循环遍历pandas中的行要快得多。有人可以解释为什么在这种情况下它会变慢吗？

Answer 1

textView.setTextColor(ContextCompat.getColor(getApplicationContext(), R.color.color_black));在引擎盖下使用循环，因此如果需要better performance，最好的和最快的是有效的替代品。

没有循环，只有链2条件矢量化解决方案：

apply

感谢Jon Clements提供另一种解决方案：

m1 = all_actions['Lower'] <= all_actions['Mid']
m2 = all_actions['Mid'] <= all_actions['Upper']
qualified_actions = m1 & m2

<强>计时：

all_actions.Mid.between(all_actions.Lower, all_actions.Upper)

np.random.seed(2017)
N = 45000
all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper'])

#print (all_actions)

为什么pandas在这里应用lambda比循环慢？

1 个答案: