值错误:一个热编码器

时间:2018-01-17 12:08:35

标签: python scikit-learn one-hot-encoding

我有如下标签编码我的info.venue列,但是当我尝试执行One Hot Encoding时,它会出错。如 ValueError:预期的2D数组,取而代之的是1D数组。

df['info.venue']=labelencoder.fit_transform(df['info.venue'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder()
var1=onehotencoder.fit_transform(df['info.venue'])

我的专栏是这样的。

info.venue
Adelaide Oval
Brabourne Stadium
Kensington Oval, Bridgetown
Kingsmead
Melbourne Cricket Ground
Melbourne Cricket Ground
Melbourne Cricket Ground
Punjab Cricket Association IS Bindra Stadium, Mohali
R Premadasa Stadium
Saurashtra Cricket Association Stadium
Shere Bangla National Stadium
Stadium Australia
Sydney Cricket Ground

我想对这个体育场名称进行编码。 但是得到了价值错误。

1 个答案:

答案 0 :(得分:2)

OneHotEncoder期待2D数组并且你传递了1D(系列 - labelencoder.fit_transform的resilt) - 这很容易修复 - 使用df[['info.venue']]代替df['info.venue'](注意方括号)以下列方式:

df['info.venue']=labelencoder.fit_transform(df['info.venue'])
R = onehotencoder.fit_transform(df[['info.venue']])

其中R是稀疏的二维矩阵:

In [155]: R
Out[155]:
<13x11 sparse matrix of type '<class 'numpy.float64'>'
        with 13 stored elements in Compressed Sparse Row format>

In [156]: R.A
Out[156]:
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

或者您可以使用LabelBinarizer直接从字符串中获取One Hot Encoded值:

来源DF:

In [121]: df
Out[121]:
                                info.venue
0                            Adelaide Oval
1                        Brabourne Stadium
2              Kensington Oval, Bridgetown
3                                Kingsmead
4                 Melbourne Cricket Ground
..                                     ...
8                      R Premadasa Stadium
9   Saurashtra Cricket Association Stadium
10           Shere Bangla National Stadium
11                       Stadium Australia
12                   Sydney Cricket Ground

[13 rows x 1 columns]

解决方案:

In [122]: from sklearn.preprocessing import LabelBinarizer

In [123]: lb = LabelBinarizer()

In [124]: r = pd.SparseDataFrame(lb.fit_transform(df['info.venue']),
     ...:                        df.index,
     ...:                        lb.classes_,
     ...:                        default_fill_value=0)
     ...:

In [125]: r
Out[125]:
    Adelaide Oval  Brabourne Stadium  Kensington Oval, Bridgetown  Kingsmead  Melbourne Cricket Ground  \
0               1                  0                            0          0                         0
1               0                  1                            0          0                         0
2               0                  0                            1          0                         0
3               0                  0                            0          1                         0
4               0                  0                            0          0                         1
..            ...                ...                          ...        ...                       ...
8               0                  0                            0          0                         0
9               0                  0                            0          0                         0
10              0                  0                            0          0                         0
11              0                  0                            0          0                         0
12              0                  0                            0          0                         0

    Punjab Cricket Association IS Bindra Stadium, Mohali  R Premadasa Stadium  Saurashtra Cricket Association Stadium  \
0                                                   0                       0                                       0
1                                                   0                       0                                       0
2                                                   0                       0                                       0
3                                                   0                       0                                       0
4                                                   0                       0                                       0
..                                                ...                     ...                                     ...
8                                                   0                       1                                       0
9                                                   0                       0                                       1
10                                                  0                       0                                       0
11                                                  0                       0                                       0
12                                                  0                       0                                       0

    Shere Bangla National Stadium  Stadium Australia  Sydney Cricket Ground
0                               0                  0                      0
1                               0                  0                      0
2                               0                  0                      0
3                               0                  0                      0
4                               0                  0                      0
..                            ...                ...                    ...
8                               0                  0                      0
9                               0                  0                      0
10                              1                  0                      0
11                              0                  1                      0
12                              0                  0                      1

[13 rows x 11 columns]