我注意到Pandas groupby().filter()
对于大型数据集来说速度很慢。比同等merge
慢得多。这是我的例子:
size = 50000000
df = pd.DataFrame( { 'M' : np.random.randint(10,size=size), 'A' : np.random.randn(size), 'B' :np.random.randn(size)})
%%time
gb = df.groupby('M').filter(lambda x : x['A'].count()%2==0)
Wall time: 14 s
%%time
gb_int = df.groupby('M').count()%2==0
gb_int = gb_int[gb_int['A'] == True]
gb = df.merge(gb_int, left_on='M', right_index=True)
Wall time: 8.39 s
任何人都可以帮助我理解为什么groupby filter
这么慢?
答案 0 :(得分:1)
使用%%prun
,您会发现merge
更快依赖inner_join
,pandas.hashtable.Int64Factorizer
,而较慢的filter
使用groupby_indices
和{{{} 1}}(仅显示消耗超过0.02秒的呼叫):
sort
慢 `merge`: 3361 function calls (3285 primitive calls) in 5.420 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.092 1.092 1.092 1.092 {pandas.algos.inner_join}
4 0.768 0.192 0.768 0.192 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
1 0.578 0.578 0.578 0.578 {pandas.algos.take_2d_axis1_float64_float64}
4 0.512 0.128 0.512 0.128 {method 'take' of 'numpy.ndarray' objects}
1 0.425 0.425 0.425 0.425 {method 'get_labels' of 'pandas.hashtable.Int64HashTable' objects}
1 0.381 0.381 0.381 0.381 {pandas.algos.take_2d_axis0_float64_float64}
1 0.296 0.296 0.296 0.296 {pandas.algos.take_2d_axis1_int64_int64}
1 0.203 0.203 1.563 1.563 groupby.py:3730(count)
1 0.194 0.194 0.194 0.194 merge.py:746(_get_join_keys)
1 0.130 0.130 5.420 5.420 <string>:2(<module>)
2 0.109 0.054 0.109 0.054 common.py:250(_isnull_ndarraylike)
3 0.099 0.033 0.107 0.036 internals.py:4768(needs_filling)
2 0.099 0.050 0.875 0.438 merge.py:687(_factorize_keys)
2 0.094 0.047 0.200 0.100 groupby.py:3740(<genexpr>)
2 0.083 0.041 0.083 0.041 {pandas.algos.take_2d_axis1_bool_bool}
1 0.081 0.081 0.772 0.772 algorithms.py:156(factorize)
7 0.058 0.008 1.406 0.201 common.py:733(take_nd)
1 0.049 0.049 2.521 2.521 merge.py:322(_get_join_info)
1 0.035 0.035 2.196 2.196 merge.py:516(_get_join_indexers)
1 0.030 0.030 0.030 0.030 {built-in method numpy.core.multiarray.putmask}
1 0.030 0.030 0.033 0.033 merge.py:271(_maybe_add_join_keys)
1 0.028 0.028 3.725 3.725 merge.py:26(merge)
28 0.021 0.001 0.021 0.001 {method 'reduce' of 'numpy.ufunc' objects}
:
filter