在pandas中使用get_dummies

时间:2017-09-26 05:09:31

标签: python pandas

我正在阅读一本关于使用Python的机器学习简介的书。这里作者描述如下 让我们说工作类功能,我们有可能的政府价值观 员工","私人雇员","自雇人员"和"自雇企业 泰德"

print("Original features:\n", list(data.columns), "\n")

data_dummies = pd.get_dummies(data)

print("Features after get_dummies:\n", list(data_dummies.columns))

Original features:
['age', 'workclass']

Features after get_dummies:
['age', 'workclass_ ?', 'workclass_ Government Employee', 'workclass_Private Employee', 'workclass_Self Employed', 'workclass_Self Employed Incorporated']

我的问题是什么是新列workclass_?

1 个答案:

答案 0 :(得分:3)

使用列count = ''.join(sentList).count(userLetter) 的字符串值创建:

workclass
data = pd.DataFrame({'age':[1,1,1,2,1,1],
                   'workclass':['Government Employee','Private Employee','Self Employed','Self Employed Incorpora ted','Self Employed Incorpora ted','?']})

print (data)
   age                    workclass
0    1          Government Employee
1    1             Private Employee
2    1                Self Employed
3    2  Self Employed Incorpora ted
4    1  Self Employed Incorpora ted
5    1                            ?

如果有多个具有相同值的列,则此前缀非常有用:

data_dummies = pd.get_dummies(data)
print (data_dummies)
   age  workclass_?  workclass_Government Employee  \
0    1            0                              1   
1    1            0                              0   
2    1            0                              0   
3    2            0                              0   
4    1            0                              0   
5    1            1                              0   

   workclass_Private Employee  workclass_Self Employed  \
0                           0                        0   
1                           1                        0   
2                           0                        1   
3                           0                        0   
4                           0                        0   
5                           0                        0   

   workclass_Self Employed Incorpora ted  
0                                      0  
1                                      0  
2                                      0  
3                                      1  
4                                      1  
5                                      0  

如果不需要,可以添加参数以用空格覆盖它:

data = pd.DataFrame({'age':[1,1,3],
                   'workclass':['Government Employee','Private Employee','?'],
                   'workclass1':['Government Employee','Private Employee','Self Employed']})

print (data)
   age            workclass           workclass1
0    1  Government Employee  Government Employee
1    1     Private Employee     Private Employee
2    3                    ?        Self Employed

data_dummies = pd.get_dummies(data)
print (data_dummies)
   age  workclass_?  workclass_Government Employee  \
0    1            0                              1   
1    1            0                              0   
2    3            1                              0   

   workclass_Private Employee  workclass1_Government Employee  \
0                           0                               1   
1                           1                               0   
2                           0                               0   

   workclass1_Private Employee  workclass1_Self Employed  
0                            0                         0  
1                            1                         0  
2                            0                         1  

然后列可能data_dummies = pd.get_dummies(data, prefix='', prefix_sep='') print (data_dummies) age ? Government Employee Private Employee Government Employee \ 0 1 0 1 0 1 1 1 0 0 1 0 2 3 1 0 0 0 Private Employee Self Employed 0 0 0 1 1 0 2 0 1 ,每个唯一列可以汇总groupby个虚拟对象:

max