为什么pandas groupby filter比merge更慢?

时间:2016-06-01 09:54:31

标签: python performance pandas

我注意到Pandas groupby().filter()对于大型数据集来说速度很慢。比同等merge慢得多。这是我的例子:

size = 50000000
df = pd.DataFrame( { 'M' : np.random.randint(10,size=size), 'A' : np.random.randn(size), 'B' :np.random.randn(size)})

%%time 
gb = df.groupby('M').filter(lambda x : x['A'].count()%2==0)

Wall time: 14 s

%%time
gb_int = df.groupby('M').count()%2==0
gb_int = gb_int[gb_int['A'] == True]
gb = df.merge(gb_int, left_on='M', right_index=True)

Wall time: 8.39 s

任何人都可以帮助我理解为什么groupby filter这么慢?

1 个答案:

答案 0 :(得分:1)

使用%%prun,您会发现merge更快依赖inner_joinpandas.hashtable.Int64Factorizer,而较慢的filter使用groupby_indices和{{{} 1}}(仅显示消耗超过0.02秒的呼叫):

sort

`merge`: 3361 function calls (3285 primitive calls) in 5.420 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 1.092 1.092 1.092 1.092 {pandas.algos.inner_join} 4 0.768 0.192 0.768 0.192 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects} 1 0.578 0.578 0.578 0.578 {pandas.algos.take_2d_axis1_float64_float64} 4 0.512 0.128 0.512 0.128 {method 'take' of 'numpy.ndarray' objects} 1 0.425 0.425 0.425 0.425 {method 'get_labels' of 'pandas.hashtable.Int64HashTable' objects} 1 0.381 0.381 0.381 0.381 {pandas.algos.take_2d_axis0_float64_float64} 1 0.296 0.296 0.296 0.296 {pandas.algos.take_2d_axis1_int64_int64} 1 0.203 0.203 1.563 1.563 groupby.py:3730(count) 1 0.194 0.194 0.194 0.194 merge.py:746(_get_join_keys) 1 0.130 0.130 5.420 5.420 <string>:2(<module>) 2 0.109 0.054 0.109 0.054 common.py:250(_isnull_ndarraylike) 3 0.099 0.033 0.107 0.036 internals.py:4768(needs_filling) 2 0.099 0.050 0.875 0.438 merge.py:687(_factorize_keys) 2 0.094 0.047 0.200 0.100 groupby.py:3740(<genexpr>) 2 0.083 0.041 0.083 0.041 {pandas.algos.take_2d_axis1_bool_bool} 1 0.081 0.081 0.772 0.772 algorithms.py:156(factorize) 7 0.058 0.008 1.406 0.201 common.py:733(take_nd) 1 0.049 0.049 2.521 2.521 merge.py:322(_get_join_info) 1 0.035 0.035 2.196 2.196 merge.py:516(_get_join_indexers) 1 0.030 0.030 0.030 0.030 {built-in method numpy.core.multiarray.putmask} 1 0.030 0.030 0.033 0.033 merge.py:271(_maybe_add_join_keys) 1 0.028 0.028 3.725 3.725 merge.py:26(merge) 28 0.021 0.001 0.021 0.001 {method 'reduce' of 'numpy.ufunc' objects}

filter