我正在基于其他列的分组向数据框添加一些列。我做了一些分组,计数,最后将结果加入原始数据帧。
完整的数据包含1M行,我首先尝试了20k行的方法,它运行正常。数据为客户添加到订单中的每个项目都有一个条目。
以下是示例数据:
import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
对于上面的样本数据,所需的输出是:
| customer_id | order_id | item_id | total_nitems_user_lifetime | nitems_in_order
| 101 | 201 | 301 | 6 | 3
| 101 | 201 | 302 | 6 | 3
| 101 | 201 | 303 | 6 | 3
| 101 | 202 | 301 | 6 | 2
| 101 | 202 | 302 | 6 | 2
| 101 | 203 | 301 | 6 | 1
即使是1M行,相对快速运行的代码片段是:
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
但类似的加入,需要相当长的时间〜几个小时:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
我希望有一种更聪明的方法来获得相同的聚合值。我理解为什么在第二种情况下需要很长时间,因为组的数量增加了很多。谢谢
答案 0 :(得分:0)
好的,我可以看到你想要达到的目标,并且在这个样本量上,它的速度提高了2倍以上,而且我认为也可能更好地扩展,基本上不是加入/合并你的团队的结果到原来的df,只需拨打transform
:
In [24]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
In [26]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
有趣的是,当我尝试600,000行时df:
In [34]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop
我的第一种方法速度提高了约25%但实际上比你的方法慢,我认为值得尝试一下你的实际数据,看看它是否能提高速度。
如果我们将列创建结合起来以便它在一行上:
In [40]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'), df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop
我们可以看到我的组合代码比你的快得多,所以通过这样做没有多少保存,通常你可以应用多个聚合函数,这样你就可以返回多个列,但问题在于你按不同的列进行分组,因此我们必须执行2次昂贵的groupby操作。
答案 1 :(得分:0)
原始方法,包含1M行:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time: 0:00:02.422288
通过@EdChum转换建议:
df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601
使用groupby,然后选择一列,然后计数,转换回dataframe,最后加入。结果:快得多:
df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383
感谢。