Pandas数据帧加速组加速

时间:2015-01-28 16:14:52

标签: python performance pandas dataframe

我正在基于其他列的分组向数据框添加一些列。我做了一些分组,计数,最后将结果加入原始数据帧。

完整的数据包含1M行,我首先尝试了20k行的方法,它运行正常。数据为客户添加到订单中的每个项目都有一个条目。

以下是示例数据:

import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

对于上面的样本数据,所需的输出是:

| customer_id   | order_id | item_id     | total_nitems_user_lifetime | nitems_in_order
|   101 | 201      |   301   |      6             |    3           
|   101 | 201      |   302   |      6             |    3           
|   101 | 201      |   303   |      6             |    3           
|   101 | 202      |   301   |      6             |    2           
|   101 | 202      |   302   |      6             |    2           
|   101 | 203      |   301   |      6             |    1           

即使是1M行,相对快速运行的代码片段是:

df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
          ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']

但类似的加入,需要相当长的时间〜几个小时:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

我希望有一种更聪明的方法来获得相同的聚合值。我理解为什么在第二种情况下需要很长时间,因为组的数量增加了很多。谢谢

2 个答案:

答案 0 :(得分:0)

好的,我可以看到你想要达到的目标,并且在这个样本量上,它的速度提高了2倍以上,而且我认为也可能更好地扩展,基本上不是加入/合并你的团队的结果到原来的df,只需拨打transform

In [24]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1
In [26]:


%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1

有趣的是,当我尝试600,000行时df:

In [34]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:

%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop

我的第一种方法速度提高了约25%但实际上比你的方法慢,我认为值得尝试一下你的实际数据,看看它是否能提高速度。

如果我们将列创建结合起来以便它在一行上:

In [40]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'),  df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop

我们可以看到我的组合代码比你的快得多,所以通过这样做没有多少保存,通常你可以应用多个聚合函数,这样你就可以返回多个列,但问题在于你按不同的列进行分组,因此我们必须执行2次昂贵的groupby操作。

答案 1 :(得分:0)

原始方法,包含1M行:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
                       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time:  0:00:02.422288

通过@EdChum转换建议:

df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601

使用groupby,然后选择一列,然后计数,转换回dataframe,最后加入。结果:快得多:

df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383

感谢。