OneHotEncoding引发IndexError:用作索引的数组必须是整数(或布尔)类型

时间:2017-05-04 20:20:53

标签: python scikit-learn dummy-variable one-hot-encoding index-error

我有一个名为data的数据框,它具有以下给定的属性:

[880 rows x 10 columns]
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 880 entries, (123, 456) to (789, 890)
Data columns (total 10 columns):
Date_Diff            880 non-null float64
Response             880 non-null category
Len1                 880 non-null int64
Type1                877 non-null category
Len2                 880 non-null int64
Type2                880 non-null category
Len_Diff             880 non-null int64
Same_Institution     880 non-null category
Same_Type            880 non-null category
Score                880 non-null float64
dtypes: category(5), float64(2), int64(3)
memory usage: 82.0+ KB
None

注意:数据框上的索引是名为ID1和ID2的字符串列。这是我设置multiindex的方式:data = data.set_index(['ID1','ID2'], drop = True)。自drop = True起,您就无法在上述数据框中看到它们。

我正在尝试使用Type1Type2对分类变量LabelEncoderOneHotEncoder进行编码。这是我的代码:

# Encoding function
def encode(data):
    global cat_columns
    cat_columns = list(data.select_dtypes(include=['category','object']))
    le = LabelEncoder()
    ohe = OneHotEncoder(categorical_features = cat_columns)
    for col in cat_columns:
        data[col] = le.fit_transform(data[col])
    data = ohe.fit_transform(data)
    return data

# Use encoding function
encode(data)

运行此代码时,我得到IndexError。错误是:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-xxx> in <module>()
     14     return data
     15 
---> 16 encode(data)

<ipython-input-xxx> in encode(data)
---> 13     data = ohe.fit_transform(data)
     14     return data
     15 

/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
   1900         """
   1901         return _transform_selected(X, self._fit_transform,
-> 1902                                    self.categorical_features, copy=True)
   1903 
   1904     def _transform(self, X):

/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
   1706     ind = np.arange(n_features)
   1707     sel = np.zeros(n_features, dtype=bool)
-> 1708     sel[np.asarray(selected)] = True
   1709     not_sel = np.logical_not(sel)
   1710     n_selected = np.sum(sel)

IndexError: arrays used as indices must be of integer (or boolean) type

导致此错误的原因是什么?
我尝试删除ID作为索引并尝试,仍然会抛出相同的错误。

  

编辑:在此处添加数据框的子集:运行html代码段以将其视为表格。
一些专栏&#39;数据类型已经   改变了以后。数据类型在数据框属性中更新   上方。
   Response是目标变量,属于分类。
   Same_InstitutionSame_Type已从整数更改为分类二进制变量
   Type1Type2已从pandas对象更改为类别

&#13;
&#13;
<table><tbody><tr><th>ID1</th><th>ID2</th><th>Len1</th><th>Type1</th><th>Len2</th><th>Type2</th><th>Len_Diff</th><th>Date_Diff</th><th>Same_Institution</th><th>Same_Type</th><th>Score</th><th>Response</th></tr><tr><td>121</td><td>977</td><td>10185</td><td>PR</td><td>10185</td><td>MR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>214</td><td>753</td><td>5039</td><td>MR</td><td>4926</td><td>MR</td><td>113</td><td>9.266666667</td><td>0</td><td>1</td><td>0.997031978</td><td>1</td></tr><tr><td>378</td><td>919</td><td>45404</td><td>PR</td><td>45404</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>283</td><td>685</td><td>821076</td><td>40-F</td><td>412353</td><td>AR</td><td>408723</td><td>0.35</td><td>0</td><td>0</td><td>0.888266653</td><td>0</td></tr><tr><td>452</td><td>837</td><td>16343</td><td>PR</td><td>16343</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>333</td><td>726</td><td>22204</td><td>PR</td><td>20897</td><td>6-K</td><td>1307</td><td>11.3</td><td>0</td><td>0</td><td>0.99251128</td><td>1</td></tr><tr><td>107</td><td>960</td><td>9781</td><td>6-K</td><td>6073</td><td>MR</td><td>3708</td><td>0.483333333</td><td>0</td><td>0</td><td>0.933646747</td><td>0</td></tr><tr><td>236</td><td>768</td><td>3375</td><td>PR</td><td>2945</td><td>MR</td><td>430</td><td>46.58333333</td><td>0</td><td>0</td><td>0.239269675</td><td>0</td></tr><tr><td>419</td><td>829</td><td>81247</td><td>MR</td><td>81247</td><td>MR</td><td>0</td><td>0.016666667</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>184</td><td>991</td><td>51474</td><td>PR</td><td>51474</td><td>ER</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>217</td><td>868</td><td>23714</td><td>ER</td><td>26633</td><td>8-K</td><td>2919</td><td>1.716666667</td><td>0</td><td>0</td><td>0.980611207</td><td>1</td></tr><tr><td>202</td><td>622</td><td>4638</td><td>MR</td><td>4638</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>308</td><td>883</td><td>73476</td><td>ER</td><td>404584</td><td>6-K</td><td>331108</td><td>12.58333333</td><td>0</td><td>0</td><td>0.825482503</td><td>0</td></tr><tr><td>186</td><td>880</td><td>291279</td><td>FIN SUPP</td><td>320893</td><td>6-K</td><td>29614</td><td>4.483333333</td><td>0</td><td>0</td><td>0.991668299</td><td>1</td></tr><tr><td>305</td><td>896</td><td>22988</td><td>PR</td><td>28554</td><td>6-K</td><td>5566</td><td>22.1</td><td>0</td><td>0</td><td>0.941192693</td><td>0</td></tr></tbody></table>
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

我在使用OneHotEncoder时遇到了完全相同的错误。

核心问题是 categorical_features 参数不能处理命名列。来自OneHotEncoder文档:

categorical_features : "all" or array of indices or mask
    Specify what features are treated as categorical.

    - 'all' (default): All features are treated as categorical.
    - array of indices: Array of categorical feature indices.
    - mask: Array of length n_features and with dtype=bool.

对我有用的是首先使用如下代码段生成布尔掩码:

cat_columns = list(data.select_dtypes(include=['category','object']))
column_mask = []
for column_name in list(data.columns.values):
    column_mask.append(column_name in cat_columns)

# And then pass the column_mask into the OneHotEncoder
ohe = OneHotEncoder(categorical_features = column_mask)

因此您的原始功能将是:

# Encoding function
def encode(data):
    global cat_columns
    cat_columns = list(data.select_dtypes(include=['category','object']))
    column_mask = []
    for column_name in list(data.columns.values):
        column_mask.append(column_name in cat_columns)
    le = LabelEncoder()
    ohe = OneHotEncoder(categorical_features = column_mask)
    for col in cat_columns:
        data[col] = le.fit_transform(data[col])
    data = ohe.fit_transform(data)
    return data

# Use encoding function
encode(data)