IndexError:用作索引的数组在OneHotEncoding期间必须为整数(或布尔)类型Error

时间:2019-11-14 14:31:02

标签: python dataframe scikit-learn categorical-data one-hot-encoding

我有一个包含类别变量的数据框,我想应用OneHotEncoder。我的问题是在OneHotEncoder之前使用LabelEncoder解决的,但对我来说这没有意义,因为使用最新更新,OneHotEncoder接受分类变量的字符串。

示例数据框,您可以在上测试代码:

import jQuery from 'jquery'
import 'angular'

这是我尝试过的:

我尝试同时使用索引值和列名来解决错误:

data = pd.DataFrame({'col1': {0: 'ab321', 1: 'ab568', 2: 'mkld78'},
 'col2': {0: 'Red', 1: 'Blue', 2: 'Green'},
 'col3': {0: 'First', 1: 'Second', 2: 'Third'},
 'col4': {0: 'Wisconsin', 1: 'California', 2: 'Portland'},
 'col5': {0: 'a', 1: 'f', 2: 'g'},
 'col6': {0: 1, 1: 2, 2: 3},
 'target': {0: 0, 1: 0, 2: 1}})

#Index
# OneHotEncoding

from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

#Load data
train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

X= train.drop(["target"], axis = 1)
y= train["target"]
# Filter categorical columns
categorical_columns = ["col1","col2","col3","col4","col5"]
categorical_indexes = np.where(X.dtypes == 'object')[0]

# OHE
ohe = OneHotEncoder(categorical_features = categorical_columns)
# reshape data
for index in categorical_indexes:
    X.iloc[:,index] = ohe.fit_transform(X.iloc[:,index].values.reshape(-1,1))

错误回溯:

#Column Names

# OneHotEncoding

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train = pd.read_csv("data_train.csv")
test = pd.read_csv("data_test.csv")

X= train.drop(["target"], axis = 1)
y= train["target"]

# Filter categorical columns
categorical_columns = ["col1","col2","col3","col4","col5"]
categorical_indexes = np.where(X.dtypes == 'object')[0]

# OHE
ohe = OneHotEncoder(categorical_features = categorical_columns)
# reshape data
for column in categorical_columns:
    X[column] = ohe.fit_transform(X[column].values.reshape(-1,1))

1 个答案:

答案 0 :(得分:1)

您缺少OnehotEncoder的概念。使用它的方法是使其适合整个训练集。

使用此:

data = pd.DataFrame({'col1': {0: 'ab321', 1: 'ab568', 2: 'mkld78'},
 'col2': {0: 'Red', 1: 'Blue', 2: 'Green'},
 'col3': {0: 'First', 1: 'Second', 2: 'Third'},
 'col4': {0: 'Wisconsin', 1: 'California', 2: 'Portland'},
 'col5': {0: 'a', 1: 'f', 2: 'g'},
 'col6': {0: 1, 1: 2, 2: 3},
 'target': {0: 0, 1: 0, 2: 1}})


# OneHotEncoding

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train = data.iloc[0:2,:]
test = data.iloc[2:,:]

X= train.drop(["target"], axis = 1)
y= train["target"]

# Filter categorical columns
categorical_columns = ["col1","col2","col3","col4","col5"]
categorical_indexes = np.where(X.dtypes == 'object')[0]

# OHE
ohe = OneHotEncoder()
X_ = ohe.fit_transform(X)

X_
# <2x12 sparse matrix of type '<type 'numpy.float64'>'
#  with 12 stored elements in Compressed Sparse Row format>