分类字符串数据时的ValueError

时间:2018-01-14 20:27:35

标签: python python-3.x machine-learning scikit-learn data-analysis

我正在尝试解决Kaggle(https://www.kaggle.com/c/titanic)上的泰坦尼克号问题。我正在尝试使用LabelEncoder库的OneHotEncodersklearn.preprocessing类来明确编码“Sex”列。这是我的代码:

# Importing data analysis libraries
import pandas as pd
import numpy as np
import random as rnd

# Importing data visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Importing Machine Learning Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# Getting the datasets
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
combine = [train, test]

# Feature visualizations
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)

grid = sns.FacetGrid(train, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Parch', bins=20)

g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'SibSp', bins=20)

g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Fare', bins=20)

g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Sex', bins=20)

# taking care of missing values
train.fillna(train.median(), inplace = True)

# Categorising Embarked and Sex features
# train['Embarked'] = train['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} )
# train['Sex'] = train['Sex'].map( {'male': 0, 'female': 1} )

# Data preprocessing
X_train = train.iloc[:, [0, 2, 4, 5, 6, 7, 9]].values
y_train = train.iloc[:, [1]].values
X_test  = test.iloc[:, [1, 3, 4, 5, 6, 8]].values

from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, StandardScaler
labelencoder_X=LabelEncoder()
X_train[:, 0]=labelencoder_X.fit_transform(X_train[:, 0])
onehotencoder=OneHotEncoder(categorical_features=[0])
X_train=onehotencoder.fit_transform(X_train).toarray()

当我执行最后5行时,我收到以下错误:

Traceback (most recent call last):

  File "<ipython-input-58-770fc19a6644>", line 5, in <module>
    X_train=onehotencoder.fit_transform(X_train).toarray()

  File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "C:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'male'

我的错误是什么?是否有任何替代技术可以有效地编码分类数据?

1 个答案:

答案 0 :(得分:1)

OneHotEncoder需要整数值 - 这就是为什么它抱怨'male'(字符串)值。

您可以先使用LabelEncoder将非数字值编码为数字,然后应用OneHotEncoder

或使用LabelBinarizer OneHotEncode一个非数字列