这是我拥有的数据集,直觉上,公司代表唯一的公司标识符。 y1,y2,y3代表他们拥有的员工人数。 prob_y1,prob_y2,prob_y3是员工人数的概率。我需要根据y1,y2和y3的值对prob_y1,prob_y2和prob_y3进行分类。我已经附上了对这些功能进行分类的功能。
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3
0 A 1 2 7 0.006897 0.000421 0.002729
1 B 2 3 45 0.013793 0.000632 0.017544
2 C 3 4 40 0.020690 0.000842 0.015595
3 D 4 7 3 0.027586 0.001474 0.001170
4 E 5 9 4 0.034483 0.001895 0.001559
5 F 6 400 12 0.041379 0.084211 0.004678
6 G 7 50 32 0.048276 0.010526 0.012476
7 H 8 70 0 0.055172 0.014737 0.000000
8 I 9 95 76 0.062069 0.020000 0.029630
9 J 10 98 1 0.068966 0.020632 0.000390
10 K 20 2 45 0.137931 0.000421 0.017544
11 L 30 10 2000 0.206897 0.002105 0.779727
12 M 40 4000 300 0.275862 0.842105 0.116959
def func(x):
"""this function is used to compute the bin"""
if x < 5:
return "binA"
elif x >= 5 and x<10:
return "binB"
elif x >=10 and x<20:
return "binC"
elif x>=20 and x<30:
return "binD"
elif x>=30 and x<50:
return "binE"
elif x >=50 and x<100:
return "binF"
elif x>= 100 and x<200:
return "binG"
elif x>=200:
return "binH"
else:
return 'binUnc'
当我运行以下代码时,得到的结果集是我期待的。
for i in df:
if i !='Firm':
df[i] = df[i].apply(func)
print(df)
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3
0 A binA binA binB binA binA binA
1 B binA binA binE binA binA binA
2 C binA binA binE binA binA binA
3 D binA binB binA binA binA binA
4 E binB binB binA binA binA binA
5 F binB binH binC binA binA binA
6 G binB binF binE binA binA binA
7 H binB binF binA binA binA binA
8 I binB binF binF binA binA binA
9 J binC binF binA binA binA binA
10 K binD binA binE binA binA binA
11 L binE binC binH binA binA binA
12 M binE binH binH binA binA binA
我想要的结果如下:
Firm y1 biny1 y2 binY2 y3 biny3 prob_y1 biny1 prob_y2 biny2 prob_y3 biny3
A 1 binA 2 binA 7 binB 0.0069 binA 0.0004 binA 0.0027 binB
B 2 binA 3 binA 45 binE 0.0138 binA 0.0006 binA 0.0175 binE
C 3 binA 4 binA 40 binE 0.0207 binA 0.0008 binA 0.0156 binE
D 4 … 7 …. 3 binA 0.0276 … 0.0015 … 0.0012 binA
E 5 …. 9 …. 4 …. 0.0345 … 0.0019 .. 0.0016 …
在症结所在,我正在根据y1,y2和y3的值对那些概率值(prob_y1,prob_y2和prob_y3)进行分类。新列可以添加到末尾,我仅出于说明目的添加。
user3483203提供的解决方案对我有用。
答案 0 :(得分:1)
您可以使用pd.cut
替换函数
bins=df[['y1','y2','y3']].apply(lambda x : pd.cut(x,[-np.inf,5,10,20,30,50,100,200,np.inf],labels=['binA','binB','binC','binD','binE','binF','binG','binH']))
然后我们使用reindex
将新的数据框合并到原始df
newdf=pd.concat([df,bins.reindex(columns=df.columns[1:].str.split('_').str[-1])],axis=1)
答案 1 :(得分:1)
您可以在此处使用numpy.digitize
。
设置
b = np.array([5, 10, 20, 30, 50, 100, 200, np.inf])
l = np.array([f'bin{i}' for i in 'ABCDEFGH'])
# l = np.array(['bin{}'.format(i) for i in 'ABCDEFGH'])
cols = ('y1', 'y2', 'y3')
使用digitize
创建一个辅助函数:
def foo(cols, bins, labels):
for col in cols:
yield (f'{col}_bin', labels[np.digitize(df[col], bins)])
# yield ('{}_bin'.format(col), labels[np.digitize(df[col], bins)])
使用assign
附加到DataFrame:
df.assign(**dict(foo(cols, b, l)))
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3 y1_bin y2_bin y3_bin
0 A 1 2 7 0.006897 0.000421 0.002729 binA binA binB
1 B 2 3 45 0.013793 0.000632 0.017544 binA binA binE
2 C 3 4 40 0.020690 0.000842 0.015595 binA binA binE
3 D 4 7 3 0.027586 0.001474 0.001170 binA binB binA
4 E 5 9 4 0.034483 0.001895 0.001559 binB binB binA
5 F 6 400 12 0.041379 0.084211 0.004678 binB binH binC
6 G 7 50 32 0.048276 0.010526 0.012476 binB binF binE
7 H 8 70 0 0.055172 0.014737 0.000000 binB binF binA
8 I 9 95 76 0.062069 0.020000 0.029630 binB binF binF
9 J 10 98 1 0.068966 0.020632 0.000390 binC binF binA
10 K 20 2 45 0.137931 0.000421 0.017544 binD binA binE
11 L 30 10 2000 0.206897 0.002105 0.779727 binE binC binH
12 M 40 4000 300 0.275862 0.842105 0.116959 binE binH binH