Question

这是我拥有的数据集，直觉上，公司代表唯一的公司标识符。 y1，y2，y3代表他们拥有的员工人数。 prob_y1，prob_y2，prob_y3是员工人数的概率。我需要根据y1，y2和y3的值对prob_y1，prob_y2和prob_y3进行分类。我已经附上了对这些功能进行分类的功能。

   Firm  y1    y2    y3   prob_y1   prob_y2   prob_y3
0     A   1     2     7  0.006897  0.000421  0.002729
1     B   2     3    45  0.013793  0.000632  0.017544
2     C   3     4    40  0.020690  0.000842  0.015595
3     D   4     7     3  0.027586  0.001474  0.001170
4     E   5     9     4  0.034483  0.001895  0.001559
5     F   6   400    12  0.041379  0.084211  0.004678
6     G   7    50    32  0.048276  0.010526  0.012476
7     H   8    70     0  0.055172  0.014737  0.000000
8     I   9    95    76  0.062069  0.020000  0.029630
9     J  10    98     1  0.068966  0.020632  0.000390
10    K  20     2    45  0.137931  0.000421  0.017544
11    L  30    10  2000  0.206897  0.002105  0.779727
12    M  40  4000   300  0.275862  0.842105  0.116959


def func(x):
    """this function is used to compute the bin"""
    if x < 5:
        return "binA"
    elif x >= 5 and x<10:
        return "binB"
    elif x >=10 and x<20:
        return "binC"
    elif x>=20 and x<30:
        return "binD"
    elif x>=30 and x<50:
        return "binE"
    elif x >=50 and x<100:

        return "binF"
    elif x>= 100 and x<200:
        return "binG"

    elif x>=200:
        return "binH"
    else:
        return 'binUnc'

当我运行以下代码时，得到的结果集是我期待的。

for i in df:
    if i !='Firm':

        df[i] = df[i].apply(func)

    print(df)

   Firm    y1    y2    y3 prob_y1 prob_y2 prob_y3
0     A  binA  binA  binB    binA    binA    binA
1     B  binA  binA  binE    binA    binA    binA
2     C  binA  binA  binE    binA    binA    binA
3     D  binA  binB  binA    binA    binA    binA
4     E  binB  binB  binA    binA    binA    binA
5     F  binB  binH  binC    binA    binA    binA
6     G  binB  binF  binE    binA    binA    binA
7     H  binB  binF  binA    binA    binA    binA
8     I  binB  binF  binF    binA    binA    binA
9     J  binC  binF  binA    binA    binA    binA
10    K  binD  binA  binE    binA    binA    binA
11    L  binE  binC  binH    binA    binA    binA
12    M  binE  binH  binH    binA    binA    binA

我想要的结果如下：

Firm y1 biny1   y2  binY2   y3  biny3   prob_y1 biny1   prob_y2 biny2   prob_y3 biny3
A   1   binA    2   binA    7   binB    0.0069  binA    0.0004  binA    0.0027  binB
B   2   binA    3   binA    45  binE    0.0138  binA    0.0006  binA    0.0175  binE
C   3   binA    4   binA    40  binE    0.0207  binA    0.0008  binA    0.0156  binE
D   4   …   7   ….  3   binA    0.0276  …   0.0015  …   0.0012  binA
E   5   ….  9   ….  4   ….  0.0345  …   0.0019  ..  0.0016  …

在症结所在，我正在根据y1，y2和y3的值对那些概率值（prob_y1，prob_y2和prob_y3）进行分类。新列可以添加到末尾，我仅出于说明目的添加。

user3483203提供的解决方案对我有用。

Answer 1

您可以使用pd.cut替换函数

bins=df[['y1','y2','y3']].apply(lambda x : pd.cut(x,[-np.inf,5,10,20,30,50,100,200,np.inf],labels=['binA','binB','binC','binD','binE','binF','binG','binH']))

然后我们使用reindex将新的数据框合并到原始df

newdf=pd.concat([df,bins.reindex(columns=df.columns[1:].str.split('_').str[-1])],axis=1)

Answer 2

您可以在此处使用numpy.digitize。

设置

b = np.array([5, 10, 20, 30, 50, 100, 200, np.inf])
l = np.array([f'bin{i}' for i in 'ABCDEFGH'])
# l = np.array(['bin{}'.format(i) for i in 'ABCDEFGH'])
cols = ('y1', 'y2', 'y3')

使用digitize创建一个辅助函数：

def foo(cols, bins, labels):
    for col in cols:
        yield (f'{col}_bin', labels[np.digitize(df[col], bins)])
        # yield ('{}_bin'.format(col), labels[np.digitize(df[col], bins)])

使用assign附加到DataFrame：

df.assign(**dict(foo(cols, b, l)))

   Firm  y1    y2    y3   prob_y1   prob_y2   prob_y3 y1_bin y2_bin y3_bin
0     A   1     2     7  0.006897  0.000421  0.002729   binA   binA   binB
1     B   2     3    45  0.013793  0.000632  0.017544   binA   binA   binE
2     C   3     4    40  0.020690  0.000842  0.015595   binA   binA   binE
3     D   4     7     3  0.027586  0.001474  0.001170   binA   binB   binA
4     E   5     9     4  0.034483  0.001895  0.001559   binB   binB   binA
5     F   6   400    12  0.041379  0.084211  0.004678   binB   binH   binC
6     G   7    50    32  0.048276  0.010526  0.012476   binB   binF   binE
7     H   8    70     0  0.055172  0.014737  0.000000   binB   binF   binA
8     I   9    95    76  0.062069  0.020000  0.029630   binB   binF   binF
9     J  10    98     1  0.068966  0.020632  0.000390   binC   binF   binA
10    K  20     2    45  0.137931  0.000421  0.017544   binD   binA   binE
11    L  30    10  2000  0.206897  0.002105  0.779727   binE   binC   binH
12    M  40  4000   300  0.275862  0.842105  0.116959   binE   binH   binH

根据其他列对列进行分类并追加到新列

2 个答案: