我有一个数据集,该数据集的no_employees列是str对象。什么是在数据框中创建新列(company_size)并使用基于no_employees列的值填充它的最佳方法,如下面的示例
mental_health_df = pd.read_csv("Mental Health.csv")
pd.set_option('display.max_columns', None)
mental_health_df.head(100)
no_employees company_size
|
6-25 |Small
More than 1000 |Extremely Large
500-1000 |Very Large
26-100 |Medium
100-500 |Large
1-5 |Very Small
答案 0 :(得分:3)
请使用df.cut
import numpy as np
df['company_size']=pd.cut(df['no_employees']. astype('category').cat.codes*10,[-np.inf,9,19,29,39,49,np.inf], labels=['Very Small','Large','Medium','Very Large','Small','Extremely Large'])
print(df)
no_employees company_size
0 6-25 Small
1 More than 1000 Extremely Large
2 500-1000 Very Large
3 26-100 Medium
4 100-500 Large
5 1-5 Very Small
工作方式
#Converted no of employees to codes but for ease of defining bins multiplied by ten
df['no_employees']. astype('category').cat.codes*10
#Decided to bin using df.cut
pd.cut(df['no_employees']. astype('category').cat.codes*10,\
[-np.inf,9,19,29,39,49,np.inf], labels=['Very Small','Large','Medium','Very Large','Small','Extremely Large'])