我正在尝试解决Kaggle(https://www.kaggle.com/c/titanic)上的泰坦尼克号问题。我正在尝试使用LabelEncoder
库的OneHotEncoder
和sklearn.preprocessing
类来明确编码“Sex”列。这是我的代码:
# Importing data analysis libraries
import pandas as pd
import numpy as np
import random as rnd
# Importing data visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Importing Machine Learning Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
# Getting the datasets
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
combine = [train, test]
# Feature visualizations
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)
grid = sns.FacetGrid(train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Parch', bins=20)
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'SibSp', bins=20)
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Fare', bins=20)
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Sex', bins=20)
# taking care of missing values
train.fillna(train.median(), inplace = True)
# Categorising Embarked and Sex features
# train['Embarked'] = train['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} )
# train['Sex'] = train['Sex'].map( {'male': 0, 'female': 1} )
# Data preprocessing
X_train = train.iloc[:, [0, 2, 4, 5, 6, 7, 9]].values
y_train = train.iloc[:, [1]].values
X_test = test.iloc[:, [1, 3, 4, 5, 6, 8]].values
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, StandardScaler
labelencoder_X=LabelEncoder()
X_train[:, 0]=labelencoder_X.fit_transform(X_train[:, 0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X_train=onehotencoder.fit_transform(X_train).toarray()
当我执行最后5行时,我收到以下错误:
Traceback (most recent call last):
File "<ipython-input-58-770fc19a6644>", line 5, in <module>
X_train=onehotencoder.fit_transform(X_train).toarray()
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 2019, in fit_transform
self.categorical_features, copy=True)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1809, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'male'
我的错误是什么?是否有任何替代技术可以有效地编码分类数据?
答案 0 :(得分:1)
OneHotEncoder需要整数值 - 这就是为什么它抱怨'male'
(字符串)值。
您可以先使用LabelEncoder将非数字值编码为数字,然后应用OneHotEncoder
或使用LabelBinarizer OneHotEncode一个非数字列