我有一个名为data
的数据框,它具有以下给定的属性:
[880 rows x 10 columns] <class 'pandas.core.frame.DataFrame'> MultiIndex: 880 entries, (123, 456) to (789, 890) Data columns (total 10 columns): Date_Diff 880 non-null float64 Response 880 non-null category Len1 880 non-null int64 Type1 877 non-null category Len2 880 non-null int64 Type2 880 non-null category Len_Diff 880 non-null int64 Same_Institution 880 non-null category Same_Type 880 non-null category Score 880 non-null float64 dtypes: category(5), float64(2), int64(3) memory usage: 82.0+ KB None
注意:数据框上的索引是名为ID1和ID2的字符串列。这是我设置multiindex的方式:data = data.set_index(['ID1','ID2'], drop = True)
。自drop = True
起,您就无法在上述数据框中看到它们。
我正在尝试使用Type1
和Type2
对分类变量LabelEncoder
和OneHotEncoder
进行编码。这是我的代码:
# Encoding function
def encode(data):
global cat_columns
cat_columns = list(data.select_dtypes(include=['category','object']))
le = LabelEncoder()
ohe = OneHotEncoder(categorical_features = cat_columns)
for col in cat_columns:
data[col] = le.fit_transform(data[col])
data = ohe.fit_transform(data)
return data
# Use encoding function
encode(data)
运行此代码时,我得到IndexError
。错误是:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-xxx> in <module>()
14 return data
15
---> 16 encode(data)
<ipython-input-xxx> in encode(data)
---> 13 data = ohe.fit_transform(data)
14 return data
15
/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
1706 ind = np.arange(n_features)
1707 sel = np.zeros(n_features, dtype=bool)
-> 1708 sel[np.asarray(selected)] = True
1709 not_sel = np.logical_not(sel)
1710 n_selected = np.sum(sel)
IndexError: arrays used as indices must be of integer (or boolean) type
导致此错误的原因是什么?
我尝试删除ID作为索引并尝试,仍然会抛出相同的错误。
编辑:在此处添加数据框的子集:运行html代码段以将其视为表格。
一些专栏&#39;数据类型已经 改变了以后。数据类型在数据框属性中更新 上方。
Response
是目标变量,属于分类。
Same_Institution
和Same_Type
已从整数更改为分类二进制变量
Type1
和Type2
已从pandas对象更改为类别
<table><tbody><tr><th>ID1</th><th>ID2</th><th>Len1</th><th>Type1</th><th>Len2</th><th>Type2</th><th>Len_Diff</th><th>Date_Diff</th><th>Same_Institution</th><th>Same_Type</th><th>Score</th><th>Response</th></tr><tr><td>121</td><td>977</td><td>10185</td><td>PR</td><td>10185</td><td>MR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>214</td><td>753</td><td>5039</td><td>MR</td><td>4926</td><td>MR</td><td>113</td><td>9.266666667</td><td>0</td><td>1</td><td>0.997031978</td><td>1</td></tr><tr><td>378</td><td>919</td><td>45404</td><td>PR</td><td>45404</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>283</td><td>685</td><td>821076</td><td>40-F</td><td>412353</td><td>AR</td><td>408723</td><td>0.35</td><td>0</td><td>0</td><td>0.888266653</td><td>0</td></tr><tr><td>452</td><td>837</td><td>16343</td><td>PR</td><td>16343</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>333</td><td>726</td><td>22204</td><td>PR</td><td>20897</td><td>6-K</td><td>1307</td><td>11.3</td><td>0</td><td>0</td><td>0.99251128</td><td>1</td></tr><tr><td>107</td><td>960</td><td>9781</td><td>6-K</td><td>6073</td><td>MR</td><td>3708</td><td>0.483333333</td><td>0</td><td>0</td><td>0.933646747</td><td>0</td></tr><tr><td>236</td><td>768</td><td>3375</td><td>PR</td><td>2945</td><td>MR</td><td>430</td><td>46.58333333</td><td>0</td><td>0</td><td>0.239269675</td><td>0</td></tr><tr><td>419</td><td>829</td><td>81247</td><td>MR</td><td>81247</td><td>MR</td><td>0</td><td>0.016666667</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>184</td><td>991</td><td>51474</td><td>PR</td><td>51474</td><td>ER</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>217</td><td>868</td><td>23714</td><td>ER</td><td>26633</td><td>8-K</td><td>2919</td><td>1.716666667</td><td>0</td><td>0</td><td>0.980611207</td><td>1</td></tr><tr><td>202</td><td>622</td><td>4638</td><td>MR</td><td>4638</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>308</td><td>883</td><td>73476</td><td>ER</td><td>404584</td><td>6-K</td><td>331108</td><td>12.58333333</td><td>0</td><td>0</td><td>0.825482503</td><td>0</td></tr><tr><td>186</td><td>880</td><td>291279</td><td>FIN SUPP</td><td>320893</td><td>6-K</td><td>29614</td><td>4.483333333</td><td>0</td><td>0</td><td>0.991668299</td><td>1</td></tr><tr><td>305</td><td>896</td><td>22988</td><td>PR</td><td>28554</td><td>6-K</td><td>5566</td><td>22.1</td><td>0</td><td>0</td><td>0.941192693</td><td>0</td></tr></tbody></table>
&#13;
答案 0 :(得分:1)
我在使用OneHotEncoder时遇到了完全相同的错误。
核心问题是 categorical_features 参数不能处理命名列。来自OneHotEncoder文档:
categorical_features : "all" or array of indices or mask
Specify what features are treated as categorical.
- 'all' (default): All features are treated as categorical.
- array of indices: Array of categorical feature indices.
- mask: Array of length n_features and with dtype=bool.
对我有用的是首先使用如下代码段生成布尔掩码:
cat_columns = list(data.select_dtypes(include=['category','object']))
column_mask = []
for column_name in list(data.columns.values):
column_mask.append(column_name in cat_columns)
# And then pass the column_mask into the OneHotEncoder
ohe = OneHotEncoder(categorical_features = column_mask)
因此您的原始功能将是:
# Encoding function
def encode(data):
global cat_columns
cat_columns = list(data.select_dtypes(include=['category','object']))
column_mask = []
for column_name in list(data.columns.values):
column_mask.append(column_name in cat_columns)
le = LabelEncoder()
ohe = OneHotEncoder(categorical_features = column_mask)
for col in cat_columns:
data[col] = le.fit_transform(data[col])
data = ohe.fit_transform(data)
return data
# Use encoding function
encode(data)