来自Pandas get_dummies的重复列

时间:2019-10-31 08:07:24

标签: pandas

采用如下数据集(从df.head()输出)

individual  states
1           Alaska, Hawaii 
2           Hawaii, Alaska
3           Kansas, Iowa, Maryland
4           New Jersey, Newada
5           Newada, New Jersey

如果我跑步

df['states'].str.get_dummies(sep=',')

我得到以下内容

    Hawaii  Iowa    Maryland    New Jersey  Newada  Alaska  Hawaii  Kansas  New Jersey  Newada
0   1   0   0   0   0   1   0   0   0   0
1   0   0   0   0   0   1   1   0   0   0
2   0   1   1   0   0   0   0   1   0   0
3   0   0   0   0   1   0   0   0   1   0
4   0   0   0   1   0   0   0   0   0   1

请注意重复(重复)的列。多个列出现之间的值不同,所以我不能只删除它们。问题出在哪里,我该怎么办?提前致谢!

1 个答案:

答案 0 :(得分:3)

问题是分隔符,需要', ',否则获得一些带有空格的列名,这与不使用空格的情况不同,因此创建了新列:

df1 = df['states'].str.get_dummies(sep=',')

print (df1.columns)
Index([' Alaska', ' Hawaii', ' Iowa', ' Maryland', ' New Jersey', ' Newada',
       'Alaska', 'Hawaii', 'Kansas', 'New Jersey', 'Newada'],
      dtype='object')

print (df1)
    Alaska   Hawaii   Iowa   Maryland   New Jersey   Newada  Alaska  Hawaii  \
0        0        1      0          0            0        0       1       0   
1        1        0      0          0            0        0       0       1   
2        0        0      1          1            0        0       0       0   
3        0        0      0          0            0        1       0       0   
4        0        0      0          0            1        0       0       0   

   Kansas  New Jersey  Newada  
0       0           0       0  
1       0           0       0  
2       1           0       0  
3       0           1       0  
4       0           0       1  

df2 = df['states'].str.get_dummies(sep=', ')
print (df2)
   Alaska  Hawaii  Iowa  Kansas  Maryland  New Jersey  Newada
0       1       1     0       0         0           0       0
1       1       1     0       0         0           0       0
2       0       0     1       1         1           0       0
3       0       0     0       0         0           1       1
4       0       0     0       0         0           1       1