我需要创建一个新列,如下所示:
假设我的数据集看起来像
Items Date
calzini 2020/02/23
cintura 2020/02/21
maglietta 2020/02/23
maglietta 2020/02/22
cappello 2020/02/23
jeans 2020/02/23
cappello 2020/02/22
maglietta 2020/02/22
maglietta 2020/02/22
jeans 2020/02/22
jeans 2020/02/23
maglietta 2020/02/23
jeans 2020/02/22
jeans 2020/02/23
我想拥有
Items Category
calzini bad
cintura bad
maglietta best seller
maglietta best seller
jeans best seller
cappello ok
jeans best seller
cappello ok
maglietta best seller
maglietta best seller
jeans best seller
maglietta best seller
jeans best seller
jeans best seller
我已经确定了物品的出现频率,如下所示:
sold_items=df.groupby(['Items'])['Date'].count().sort_values(ascending=False) # the items should be counted overall, not using a specific Date! It is about how many items were sold
我想问你如何用这些值创建一个新列。
答案 0 :(得分:3)
下面的代码应该可以工作。
-Append
答案 1 :(得分:2)
您可以使用GroupBy.transform
和np.select
vals = df['Items'].groupby(df['Items']).transform('count')
condlist = [vals.ge(5), (vals.ge(2) & vals.lt(5)) , vals.le(2)]
choicelist = ['best seller', 'ok', 'bad']
df.assign(category = np.select(condlist, choicelist))
Items Date category
0 calzini 2020/02/23 bad
1 cintura 2020/02/21 bad
2 maglietta 2020/02/23 best seller
3 maglietta 2020/02/22 best seller
4 cappello 2020/02/23 ok
5 jeans 2020/02/23 best seller
6 cappello 2020/02/22 ok
7 maglietta 2020/02/22 best seller
8 maglietta 2020/02/22 best seller
9 jeans 2020/02/22 best seller
10 jeans 2020/02/23 best seller
11 maglietta 2020/02/23 best seller
12 jeans 2020/02/22 best seller
13 jeans 2020/02/23 best seller
答案 2 :(得分:2)
您可以在value_counts上使用cut:
pd.cut(df['Items'].value_counts(),bins=[0,1,4,10])
maglietta (4, 10]
jeans (4, 10]
cappello (1, 4]
calzini (0, 1]
cintura (0, 1]
Name: Items, dtype: category
Categories (3, interval[int64]): [(0, 1] < (1, 4] < (4, 10]]
因此,此切口不包括最低位,因此左侧为圆括号,在右侧为高位方括号。现在,我们将这些标签转换为您需要的标签:
cats = pd.cut(df['Items'].value_counts(),bins=[0,1,4,10],labels=['bad','ok','best seller'])
只需根据类别映射值,然后使用.tonumpy()将其分配给新列(感谢@ Ch3steR指出来,请参见注释):
df['Category'] = cats[df['Items']].to_numpy()
df
Items Date Category
0 calzini 2020/02/23 bad
1 cintura 2020/02/21 bad
2 maglietta 2020/02/23 best seller
3 maglietta 2020/02/22 best seller
4 cappello 2020/02/23 ok
5 jeans 2020/02/23 best seller
6 cappello 2020/02/22 ok
7 maglietta 2020/02/22 best seller
8 maglietta 2020/02/22 best seller
9 jeans 2020/02/22 best seller
10 jeans 2020/02/23 best seller
11 maglietta 2020/02/23 best seller
12 jeans 2020/02/22 best seller
13 jeans 2020/02/23 best seller
您也可以使用df['Category'] = df['Items'].map(cats)
答案 3 :(得分:0)
使用groupby
和transform
。您还需要创建一个函数来对商品进行分类:
def categorize(x):
num = len(x)
if num >= 5:
return 'best seller'
elif num >= 3:
return 'ok'
else:
return 'bad'
df['category'] = df.groupby('Items').transform(categorize)
答案 4 :(得分:0)
根据您定义的内容,您没有类别对日期的依赖(由outpu假定)。
您可以简单地在轴1上使用Apply功能
def testfun(e):
count = len(df[df["Items"] == e["Items"]])
if(count>=5):
return "best seller"
if(count>=2 and count<5 ):
return "ok"
else:
return "bad"
df["count"] = df.apply(testfun,axis=1)
1 cintura bad
2 maglietta best seller
3 maglietta best seller
4 cappello ok
5 jeans best seller
6 cappello ok
7 maglietta best seller
8 maglietta best seller
9 jeans best seller
10 jeans best seller
11 maglietta best seller
12 jeans best seller
13 jeans best seller
答案 5 :(得分:0)
您还可以替换value_counts
中的条件值,然后替换map
中的条件值:
counts = df['Items'].value_counts()
counts = counts.replace(counts.values, ['best seller' if i >= 5 else ('ok' if i in [2,3,4] else 'bad') for i in counts])
df['category'] = df['Items'].map(counts)