Question

我有一个这样的数据框：

 index   customerID    item_tag   orderID    Amount
   0         23            A         1        34.50
   1         55            B         2        11.22
   2         23            A         3         9.34
   3         55            D         4       123.44
   4         55            F         5       231.40

我也有一个包含item_tags这样的列表：

my_list = ['A', 'B', 'D']

现在，我想检查每个客户从my_list订购了多少种商品。例如，对于客户23，此数字将为= 1，因为客户23仅订购标记为A的商品，而不订购B或D。但是，客户55订购了项目B和D，因此该指标变量将为2，因为在其订单中存在my_list中的两种项目类型。（他还订购了商品F，但该商品不在my_list中。）

到目前为止，我尝试了groupby([customerId, item_tag], as_index = False).count()，但这需要创建新的数据帧（可能不一定吗？），然后对列表中的每个元素使用if语句，但是我怀疑还有一种更优雅的方法。我找不到，无论是在Google还是在这里都找不到。我的数据框有数百万行，因此我正在寻找最有效的解决方案。

结果，我想要这样的数据框：

 index   customerID   if_A  if_B  if_D  sum_in_list
   0         23         1     0    0        1
   1         55         0     1    1        2

Answer 1

这是使用get_dummies + groupby的一种方法，您可以免费获得计数：

res = pd.get_dummies(df[['customerID', 'item_tag']], columns=['item_tag'])\
        .groupby(['customerID'], as_index=False).sum()

print(res)

   customerID  item_tag_A  item_tag_B  item_tag_D  item_tag_F
0          23           2           0           0           0
1          55           0           1           1           1

如果您想要二进制结果并将其限制在特定的标签上，则还有另外两个步骤：

L = ['A', 'B', 'D']

df_filtered = df.loc[df['item_tag'].isin(L), ['customerID', 'item_tag']] 

res = pd.get_dummies(df_filtered, columns=['item_tag'])\
        .groupby(['customerID']).any().astype(int).reset_index()

res['total_count'] = res.iloc[:, 1:].sum(axis=1)

print(res)

   customerID  item_tag_A  item_tag_B  item_tag_D  total_count
0          23           1           0           0            1
1          55           0           1           1            2

Answer 2

我的解决方案会过滤掉不需要的产品，然后进行分组：

wanted = df[df['item_tag'].isin(my_list)]
wanted.groupby(['customerID', 'item_tag'])\
      .count().unstack()['Amount'].fillna(0).astype(int)

#item_tag    A  B  D
#customerID         
#23          2  0  0
#55          0  1  1

Answer 3

这是一个经过过滤的交叉表，在问题＃9的答案下，我们可以看到执行here的几个选项

使用`crosstab`和`clip_upper`

pd.crosstab(df.customerID, df.item_tag).clip_upper()[my_list]

item_tag    A  B  D
customerID         
23          1  0  0
55          0  1  1

添加assign以获得汇总，同时使用lambda使其内联

pd.crosstab(df.customerID, df.item_tag).clip_upper(1)[my_list].assign(
    Total=lambda d: d.sum(1))

item_tag    A  B  D  Total
customerID                
23          1  0  0      1
55          0  1  1      2

`pandas.Series`

构建新系列对象的有趣替代方法。我以将item_tag放置在MultiIndex的第一级中的方式构造它，以方便使用loc并切片我关心的标签。

s = pd.Series(1, set(zip(df.item_tag, df.customerID)))
s.loc[my_list].unstack(0, fill_value=0).assign(
    Total=lambda d: d.sum(1))

    A  B  D  Total
23  1  0  0      1
55  0  1  1      2

检查一列pandas daframe中包含多少项

3 个答案:

使用`crosstab`和`clip_upper`

`pandas.Series`

检查一列pandas daframe中包含多少项

3 个答案:

使用crosstab和clip_upper

pandas.Series

使用`crosstab`和`clip_upper`

`pandas.Series`