如何找到分组数据框中最常见的项目组合?

时间:2019-07-10 13:53:20

标签: python pandas pandas-groupby

我有一个这样的数据框:

pd.DataFrame([{"order_id": 1234, "product": "milk"},
              {"order_id": 1234, "product": "butter"},
             {"order_id": 4321, "product": "bread"}, 
             {"order_id": 4321, "product": "milk"},
             {"order_id": 4321, "product": "butter"},
             {"order_id": 1111, "product": "corn"},
             {"order_id": 1111, "product": "cereal"},
             {"order_id": 8888, "product": "milk"}])

    order_id    product
0   1234    milk
1   1234    butter
2   4321    bread
3   4321    milk
4   4321    butter
5   1111    corn
6   1111    cereal
7   8888    milk

我需要找到最常见的产品组合,而不必推断要放入这些组合中的产品数量。

此示例应返回牛奶和黄油,因为这是一起购买最多的两种牛奶。

我曾尝试按order_id对它们进行分组,但找不到找到将组合归入组的解决方案。

2 个答案:

答案 0 :(得分:3)

我们可以按mergegroupby.size查找产品对:

# merge on id to pair up the products
new_df = df.merge(df, on='order_id')

# first thing is to drop identical products
(new_df[new_df['product_x'].lt(new_df['product_y'])]
    .groupby(['order_id', 'product_x', 'product_y'])              # group
    .size()            # count (id, prod1, prod2)
    .sum(level=[1,2])  # sum over (prod1, prod2)
    .idxmax()          # get (prod1, prod2) with max count
)

给你

('butter', 'milk')

答案 1 :(得分:1)

itertools.combinationspandas.Series.mode

from itertools import combinations

pd.Series.mode([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
])

0    (milk, butter)
dtype: object

collections.Counter

与上述类似,但使用Counter代替pandas.Series.mode

from itertools import combinations
from collections import Counter

Counter([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
]).most_common(1)

[(('milk', 'butter'), 2)]