Question

我有一个pandas数据帧：

item_code    price
   1           15
   1           30
   1           60
   2           50
   3           90
   4           110
   5           130
   4           150

我们可以看到最高价格是150.我想把它分成5个箱子，每箱30个（换成新栏目）并得到每个物品代码在该价格箱中的出现次数。

final df =

item_code    0-30    31-60    61-90    91-120    121-150
    1         2         1       0         0          0
    2         0         1       0         0          0
    3         0         0       1         0          0
    4         0         0       0         1          1
    5         0         0       0         0          1

即item_code 1在0-30的价格范围内下跌两次，因此在0-30栏下将计数设为2. item_code 1在价格范围31-60内下跌一次。因此将计数设为1 ....与其他项目代码类似。

我尝试使用pd.cut

bins = [0, 30, 60, 90, 120,150]
df2 = pd.cut(df['price'], bins)

但它不起作用。

Answer 1

设置

cats = ['0-30', '31-60', '61-90', '91-120', '121-150']
bins = [0, 30, 60, 90, 120, 150]

选项1
使用pd.get_dummies和pd.DataFrame.join

df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats)))

   item_code  0-30  31-60  61-90  91-120  121-150
0          1     1      0      0       0        0
1          1     1      0      0       0        0
2          1     0      1      0       0        0
3          2     0      1      0       0        0
4          3     0      0      1       0        0
5          4     0      0      0       1        0
6          5     0      0      0       0        1
7          4     0      0      0       0        1

选项2
使用numpy的searchsorted和一些字符串数组添加。

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)]))

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     1        0      0      0       0
1          1     1        0      0      0       0
2          1     0        0      1      0       0
3          2     0        0      1      0       0
4          3     0        0      0      1       0
5          4     0        0      0      0       1
6          5     0        1      0      0       0
7          4     0        1      0      0       0

如果您想要总结类似值item_code s。您可以使用groupby代替join

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index()

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     2        0      1      0       0
1          2     0        0      1      0       0
2          3     0        0      0      1       0
3          4     0        1      0      0       1
4          5     0        1      0      0       0

选项3
使用pd.factorize和np.bincount

的快速方法

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

j, c = pd.factorize(bins.searchsorted(df.price))
i, r = pd.factorize(df.item_code.values)
n, m = c.size, r.size

pd.DataFrame(
    np.bincount(i * m + j, minlength=n * m).reshape(n, m),
    r, cats).rename_axis('item_code').reset_index()

   item_code  0-30  30-60  60-90  90-120  120-150
0          1     2      1      0       0        0
1          2     0      1      0       0        0
2          3     0      0      1       0        0
3          4     0      0      0       1        1
4          5     0      0      0       0        1

Answer 2

使用.reg和groupby

unstack

Answer 3

将参数标签添加到cut，然后添加groupby并汇总size：

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
         .size()
         .unstack(fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        1
5             0      0      0       0        1

编辑如果您想要一般解决方案，请添加reindex：

print (df)
   item_code  price
0          1     15
1          1     30
2          1     60
3          2     50
4          3     90
5          4    110

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
        .size()
        .unstack(fill_value=0)
        .reindex(columns=cats, fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        0

Answer 4

使用cut + pivot_table：

bins = [0, 30, 60, 90, 120,150]
labels = ['0-30', '31-60', '61-90', '91-120',' 121-150']

df = df.assign(bins=pd.cut(df.price, bins, labels=labels))\
       .pivot_table('price', 'item_code', 'bins', 'count').fillna(0).astype(int)

print(df)
bins       0-30  31-60  61-90  91-120   121-150
item_code                                      
1             2      1      0       0         0
2             0      1      0       0         0
3             0      0      1       0         0
4             0      0      0       1         1
5             0      0      0       0         1

创建列的bin并获取pandas中的计数

4 个答案: