在熊猫中创建一个基于频率的类别的新列

时间:2020-06-26 19:18:28

标签: python pandas dataframe

我需要创建一个新列,如下所示:

  • 如果某商品的出现频率大于或等于5,则设置“最佳卖家”;
  • 如果某项的频率介于2(包括2)到5之间,则设置为“ ok”;
  • 如果某项商品的频率低于2,则设置“不良”。

假设我的数据集看起来像

Items          Date 
calzini      2020/02/23
cintura      2020/02/21
maglietta    2020/02/23
maglietta    2020/02/22
cappello     2020/02/23
jeans        2020/02/23
cappello     2020/02/22
maglietta    2020/02/22
maglietta    2020/02/22
jeans        2020/02/22
jeans        2020/02/23
maglietta    2020/02/23
jeans        2020/02/22
jeans        2020/02/23

我想拥有

Items         Category            
calzini        bad
cintura        bad
maglietta     best seller
maglietta     best seller
jeans         best seller
cappello       ok
jeans         best seller
cappello       ok
maglietta     best seller
maglietta     best seller
jeans         best seller
maglietta     best seller
jeans         best seller
jeans         best seller

我已经确定了物品的出现频率,如下所示:

sold_items=df.groupby(['Items'])['Date'].count().sort_values(ascending=False) # the items should be counted overall, not using a specific Date! It is about how many items were sold 

我想问你如何用这些值创建一个新列。

6 个答案:

答案 0 :(得分:3)

下面的代码应该可以工作。

-Append

答案 1 :(得分:2)

您可以使用GroupBy.transformnp.select

vals = df['Items'].groupby(df['Items']).transform('count')
condlist = [vals.ge(5), (vals.ge(2) & vals.lt(5)) , vals.le(2)]
choicelist = ['best seller', 'ok', 'bad']
df.assign(category =  np.select(condlist, choicelist))

        Items        Date     category
0     calzini  2020/02/23          bad
1     cintura  2020/02/21          bad
2   maglietta  2020/02/23  best seller
3   maglietta  2020/02/22  best seller
4    cappello  2020/02/23           ok
5       jeans  2020/02/23  best seller
6    cappello  2020/02/22           ok
7   maglietta  2020/02/22  best seller
8   maglietta  2020/02/22  best seller
9       jeans  2020/02/22  best seller
10      jeans  2020/02/23  best seller
11  maglietta  2020/02/23  best seller
12      jeans  2020/02/22  best seller
13      jeans  2020/02/23  best seller

答案 2 :(得分:2)

您可以在value_counts上使用cut:

pd.cut(df['Items'].value_counts(),bins=[0,1,4,10])

maglietta    (4, 10]
jeans        (4, 10]
cappello      (1, 4]
calzini       (0, 1]
cintura       (0, 1]
Name: Items, dtype: category
Categories (3, interval[int64]): [(0, 1] < (1, 4] < (4, 10]]

因此,此切口不包括最低位,因此左侧为圆括号,在右侧为高位方括号。现在,我们将这些标签转换为您需要的标签:

cats = pd.cut(df['Items'].value_counts(),bins=[0,1,4,10],labels=['bad','ok','best seller'])

只需根据类别映射值,然后使用.tonumpy()将其分配给新列(感谢@ Ch3steR指出来,请参见注释):

df['Category'] = cats[df['Items']].to_numpy()

df

    Items       Date        Category
0   calzini     2020/02/23  bad
1   cintura     2020/02/21  bad
2   maglietta   2020/02/23  best seller
3   maglietta   2020/02/22  best seller
4   cappello    2020/02/23  ok
5   jeans       2020/02/23  best seller
6   cappello    2020/02/22  ok
7   maglietta   2020/02/22  best seller
8   maglietta   2020/02/22  best seller
9   jeans       2020/02/22  best seller
10  jeans       2020/02/23  best seller
11  maglietta   2020/02/23  best seller
12  jeans       2020/02/22  best seller
13  jeans       2020/02/23  best seller

您也可以使用df['Category'] = df['Items'].map(cats)

答案 3 :(得分:0)

使用groupbytransform。您还需要创建一个函数来对商品进行分类:

def categorize(x):
    num = len(x)
    if num >= 5:
        return 'best seller'
    elif num >= 3:
        return 'ok'
    else:
        return 'bad'

df['category'] = df.groupby('Items').transform(categorize)

答案 4 :(得分:0)

根据您定义的内容,您没有类别对日期的依赖(由outpu假定)。

您可以简单地在轴1上使用Apply功能

def testfun(e):
  count = len(df[df["Items"] == e["Items"]])
  if(count>=5):
    return "best seller"
  if(count>=2 and count<5 ):
    return "ok"
  else:
    return "bad"    


df["count"] = df.apply(testfun,axis=1)

1   cintura bad
2   maglietta   best seller
3   maglietta   best seller
4   cappello    ok
5   jeans   best seller
6   cappello    ok
7   maglietta   best seller
8   maglietta   best seller
9   jeans   best seller
10  jeans   best seller
11  maglietta   best seller
12  jeans   best seller
13  jeans   best seller

答案 5 :(得分:0)

您还可以替换value_counts中的条件值,然后替换map中的条件值:

counts = df['Items'].value_counts()
counts = counts.replace(counts.values, ['best seller' if i >= 5 else ('ok' if i in [2,3,4] else 'bad') for i in counts])
df['category'] = df['Items'].map(counts)