我想比较两种不同方法之间的性能来过滤pandas DataFrames。所以我创建了一个在平面上有this.rezerwacjeFilteredByseaarchInput.sort(function (a, b) {
if (a[5] === null) {
return 1;
}
if (firmaSortOrder) {
return a[5] - b[5];
}
return b[5] - a[5];
});
点的测试集,我过滤掉了不在单位平方中的所有点。我很惊讶一种方法比另一种方法快得多。 n
越大,差异越大。对此有何解释?
这是我的剧本
n
import numpy as np
import time
import pandas as pd
# Test set with points
n = 100000
test_x_points = np.random.uniform(-10, 10, size=n)
test_y_points = np.random.uniform(-10, 10, size=n)
test_points = zip(test_x_points, test_y_points)
df = pd.DataFrame(test_points, columns=['x', 'y'])
# Method a
start_time = time.time()
result_a = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)
# Method b
start_time = time.time()
result_b = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)
# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)
的不同值的结果:
n
当我将它与Python本地列表理解方法进行比较时,a仍然快得多
For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.
For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.
For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.
For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.
For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.
For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.
为什么?
答案 0 :(得分:1)
如果您关注Pandas source code for apply
,您会看到一般情况下它最终会进行python for __ in __
循环。
然而,Pandas DataFrames由Pandas系列组成,它们由numpy数组组成。屏蔽过滤使用numpy数组允许的快速矢量化方法。有关为什么这比执行普通python循环更快的信息(如.apply
),请参阅Why are NumPy arrays so fast?
那里的答案是:
Numpy数组是密集的同类型数组。蟒蛇 相比之下,列表是指向对象的指针数组,即使是全部 他们是同一类型。所以,你得到了地方的好处 参考
此外,许多Numpy操作都是在C中实现的,避免使用 Python中循环的代价,指针间接和每元素动态 类型检查。速度提升取决于您的操作 表演,但几个数量级的数量并不常见 嘎吱嘎吱的计划。