我有一个这样的数据框:
pd.DataFrame([{"order_id": 1234, "product": "milk"},
{"order_id": 1234, "product": "butter"},
{"order_id": 4321, "product": "bread"},
{"order_id": 4321, "product": "milk"},
{"order_id": 4321, "product": "butter"},
{"order_id": 1111, "product": "corn"},
{"order_id": 1111, "product": "cereal"},
{"order_id": 8888, "product": "milk"}])
order_id product
0 1234 milk
1 1234 butter
2 4321 bread
3 4321 milk
4 4321 butter
5 1111 corn
6 1111 cereal
7 8888 milk
我需要找到最常见的产品组合,而不必推断要放入这些组合中的产品数量。
此示例应返回牛奶和黄油,因为这是一起购买最多的两种牛奶。
我曾尝试按order_id对它们进行分组,但找不到找到将组合归入组的解决方案。
答案 0 :(得分:3)
我们可以按merge
和groupby.size
查找产品对:
# merge on id to pair up the products
new_df = df.merge(df, on='order_id')
# first thing is to drop identical products
(new_df[new_df['product_x'].lt(new_df['product_y'])]
.groupby(['order_id', 'product_x', 'product_y']) # group
.size() # count (id, prod1, prod2)
.sum(level=[1,2]) # sum over (prod1, prod2)
.idxmax() # get (prod1, prod2) with max count
)
给你
('butter', 'milk')
答案 1 :(得分:1)
itertools.combinations
和pandas.Series.mode
from itertools import combinations
pd.Series.mode([
t for _, d in df.groupby('order_id').product
for t in combinations(d, 2)
])
0 (milk, butter)
dtype: object
collections.Counter
与上述类似,但使用Counter
代替pandas.Series.mode
from itertools import combinations
from collections import Counter
Counter([
t for _, d in df.groupby('order_id').product
for t in combinations(d, 2)
]).most_common(1)
[(('milk', 'butter'), 2)]