Question

我有一个大约40k条目的pandas数据帧，格式如下：

invoiceNo |项目

import pandas as pd
df = pd.DataFrame({'invoiceNo': ['123', '123', '124', '124'], 
                   'item': ['plant', 'grass', 'hammer', 'screwdriver']})

让我们说客户可以在一个发票号码下购买多件商品。

我有办法检查哪些物品最常买？

我尝试的第一件事就是让所有唯一ID循环遍历

unique_invoice_id = df.invoiceNo.unique().tolist()

谢谢！

Answer 1

不失一般性，我将使用列表而不是数据帧。如有必要，您可以轻松地从数据框中提取所需的列表。

from itertools import combinations
from collections import defaultdict

x = [1, 1, 1, 2, 2, 2, 3, 3, 3]  # invoice number
y = ['a', 'b', 'c', 'a', 'c', 'e', 'a', 'c', 'd']  # item

z = defaultdict(set)
for i, j in zip(x, y):
    z[i].add(j)

print(z)

d = defaultdict(int)
for i in range(2, len(set(y))):
    combs = combinations(set(y), i)
    for comb in combs:
        for k, v in z.items():
            if set(comb).issubset(set(v)):
                d[tuple(comb)] += 1

list(reversed(sorted([[v, k] for k, v in d.items()])))

# [[3, ('c', 'a')],
#  [1, ('d', 'c', 'a')],
#  [1, ('d', 'c')],
#  [1, ('d', 'a')],
#  [1, ('c', 'e')],
#  [1, ('c', 'a', 'e')],
#  [1, ('b', 'c', 'a')],
#  [1, ('b', 'c')],
#  [1, ('b', 'a')],
#  [1, ('a', 'e')]]

解释是'c'和'a'一起购买3次等等。

Python - 查找最常购买的商品

1 个答案: