Question

我有一个这样的数据框：

pd.DataFrame([{"order_id": 1234, "product": "milk"},
              {"order_id": 1234, "product": "butter"},
             {"order_id": 4321, "product": "bread"}, 
             {"order_id": 4321, "product": "milk"},
             {"order_id": 4321, "product": "butter"},
             {"order_id": 1111, "product": "corn"},
             {"order_id": 1111, "product": "cereal"},
             {"order_id": 8888, "product": "milk"}])

    order_id    product
0   1234    milk
1   1234    butter
2   4321    bread
3   4321    milk
4   4321    butter
5   1111    corn
6   1111    cereal
7   8888    milk

我需要找到最常见的产品组合，而不必推断要放入这些组合中的产品数量。

此示例应返回牛奶和黄油，因为这是一起购买最多的两种牛奶。

我曾尝试按order_id对它们进行分组，但找不到找到将组合归入组的解决方案。

Answer 1

我们可以按merge和groupby.size查找产品对：

# merge on id to pair up the products
new_df = df.merge(df, on='order_id')

# first thing is to drop identical products
(new_df[new_df['product_x'].lt(new_df['product_y'])]
    .groupby(['order_id', 'product_x', 'product_y'])              # group
    .size()            # count (id, prod1, prod2)
    .sum(level=[1,2])  # sum over (prod1, prod2)
    .idxmax()          # get (prod1, prod2) with max count
)

给你

('butter', 'milk')

Answer 2

`itertools.combinations`和`pandas.Series.mode`

from itertools import combinations

pd.Series.mode([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
])

0    (milk, butter)
dtype: object

`collections.Counter`

与上述类似，但使用Counter代替pandas.Series.mode

from itertools import combinations
from collections import Counter

Counter([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
]).most_common(1)

[(('milk', 'butter'), 2)]

如何找到分组数据框中最常见的项目组合？

2 个答案:

`itertools.combinations`和`pandas.Series.mode`

`collections.Counter`

如何找到分组数据框中最常见的项目组合？

2 个答案:

itertools.combinations和pandas.Series.mode

collections.Counter

`itertools.combinations`和`pandas.Series.mode`

`collections.Counter`