我有如下标签编码我的info.venue列,但是当我尝试执行One Hot Encoding时,它会出错。如 ValueError:预期的2D数组,取而代之的是1D数组。
df['info.venue']=labelencoder.fit_transform(df['info.venue'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder()
var1=onehotencoder.fit_transform(df['info.venue'])
我的专栏是这样的。
info.venue
Adelaide Oval
Brabourne Stadium
Kensington Oval, Bridgetown
Kingsmead
Melbourne Cricket Ground
Melbourne Cricket Ground
Melbourne Cricket Ground
Punjab Cricket Association IS Bindra Stadium, Mohali
R Premadasa Stadium
Saurashtra Cricket Association Stadium
Shere Bangla National Stadium
Stadium Australia
Sydney Cricket Ground
我想对这个体育场名称进行编码。 但是得到了价值错误。
答案 0 :(得分:2)
OneHotEncoder
期待2D数组并且你传递了1D(系列 - labelencoder.fit_transform
的resilt) - 这很容易修复 - 使用df[['info.venue']]
代替df['info.venue']
(注意方括号)以下列方式:
df['info.venue']=labelencoder.fit_transform(df['info.venue'])
R = onehotencoder.fit_transform(df[['info.venue']])
其中R是稀疏的二维矩阵:
In [155]: R
Out[155]:
<13x11 sparse matrix of type '<class 'numpy.float64'>'
with 13 stored elements in Compressed Sparse Row format>
In [156]: R.A
Out[156]:
array([[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
或者您可以使用LabelBinarizer直接从字符串中获取One Hot Encoded值:
来源DF:
In [121]: df
Out[121]:
info.venue
0 Adelaide Oval
1 Brabourne Stadium
2 Kensington Oval, Bridgetown
3 Kingsmead
4 Melbourne Cricket Ground
.. ...
8 R Premadasa Stadium
9 Saurashtra Cricket Association Stadium
10 Shere Bangla National Stadium
11 Stadium Australia
12 Sydney Cricket Ground
[13 rows x 1 columns]
解决方案:
In [122]: from sklearn.preprocessing import LabelBinarizer
In [123]: lb = LabelBinarizer()
In [124]: r = pd.SparseDataFrame(lb.fit_transform(df['info.venue']),
...: df.index,
...: lb.classes_,
...: default_fill_value=0)
...:
In [125]: r
Out[125]:
Adelaide Oval Brabourne Stadium Kensington Oval, Bridgetown Kingsmead Melbourne Cricket Ground \
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
.. ... ... ... ... ...
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
11 0 0 0 0 0
12 0 0 0 0 0
Punjab Cricket Association IS Bindra Stadium, Mohali R Premadasa Stadium Saurashtra Cricket Association Stadium \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
8 0 1 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
Shere Bangla National Stadium Stadium Australia Sydney Cricket Ground
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
8 0 0 0
9 0 0 0
10 1 0 0
11 0 1 0
12 0 0 1
[13 rows x 11 columns]