采用如下数据集(从df.head()
输出)
individual states
1 Alaska, Hawaii
2 Hawaii, Alaska
3 Kansas, Iowa, Maryland
4 New Jersey, Newada
5 Newada, New Jersey
如果我跑步
df['states'].str.get_dummies(sep=',')
我得到以下内容
Hawaii Iowa Maryland New Jersey Newada Alaska Hawaii Kansas New Jersey Newada
0 1 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 1 1 0 0 0
2 0 1 1 0 0 0 0 1 0 0
3 0 0 0 0 1 0 0 0 1 0
4 0 0 0 1 0 0 0 0 0 1
请注意重复(重复)的列。多个列出现之间的值不同,所以我不能只删除它们。问题出在哪里,我该怎么办?提前致谢!
答案 0 :(得分:3)
问题是分隔符,需要', '
,否则获得一些带有空格的列名,这与不使用空格的情况不同,因此创建了新列:
df1 = df['states'].str.get_dummies(sep=',')
print (df1.columns)
Index([' Alaska', ' Hawaii', ' Iowa', ' Maryland', ' New Jersey', ' Newada',
'Alaska', 'Hawaii', 'Kansas', 'New Jersey', 'Newada'],
dtype='object')
print (df1)
Alaska Hawaii Iowa Maryland New Jersey Newada Alaska Hawaii \
0 0 1 0 0 0 0 1 0
1 1 0 0 0 0 0 0 1
2 0 0 1 1 0 0 0 0
3 0 0 0 0 0 1 0 0
4 0 0 0 0 1 0 0 0
Kansas New Jersey Newada
0 0 0 0
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
df2 = df['states'].str.get_dummies(sep=', ')
print (df2)
Alaska Hawaii Iowa Kansas Maryland New Jersey Newada
0 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0
2 0 0 1 1 1 0 0
3 0 0 0 0 0 1 1
4 0 0 0 0 0 1 1