Question

我有这个数据集。

user    Month   item
 A       Jan     X
 A       Jan     Y
 A       Feb     X
 B       Jan     Z
 B       Feb     X
 A       March   Z

我需要如下结果：

user   month Itemset  CumItemset   DistinctCount    CumDistinctCount
 A      Jan    X,Y       X,Y            2                 2
 A      Feb    X         X,Y            1                 2
 A      March  Z         X,Y,Z          1                 3
 B      Jan    Z         Z              1                 1
 B      Feb    X         Z,X            1                 2

我尝试了代码here，但我希望每个新用户重新开始累积计数。

有什么想法吗？

Answer 1

绝对没有快速声明

df = df.sort_values('user')

g1 = df.groupby(['user', 'Month'], sort=False).item.apply(list)

g2 = g1.groupby('user').apply(lambda x: x.cumsum()).apply(pd.unique)

pd.concat(
    [
        g1.apply(','.join), g2.apply(','.join),
        g1.str.len(), g2.str.len()
    ], axis=1, keys='Itemset CumItemset DistinctCount CumDistinctCount'.split()
).reset_index()

  user  Month Itemset CumItemset  DistinctCount  CumDistinctCount
0    A    Jan     X,Y        X,Y              2                 2
1    A    Feb       X        X,Y              1                 2
2    A  March       Z      X,Y,Z              1                 3
3    B    Jan       Z          Z              1                 1
4    B    Feb       X        Z,X              1                 2

获取每个行类别的数据透视表中的累积列表

1 个答案: