无法从泰坦尼克号比赛中将字符串转换为浮点错误

时间:2018-06-22 23:20:42

标签: python pandas numpy machine-learning scikit-learn

我正在尝试从Kaggle解决《泰坦尼克号》生存计划。这是我真正学习机器学习的第一步。我在性别列导致错误的地方遇到了问题。堆栈跟踪显示could not convert string to float: 'female'。你们是怎么遇到这个问题的?我不想要解决方案。我只想要一种解决此问题的实用方法,因为我确实需要性别列来构建模型。

这是我的代码:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

1 个答案:

答案 0 :(得分:8)

有几种解决方法,这取决于您要寻找的内容:

  1. 您可以将类别编码为数值,将类别的每个级别转换为不同的数字,

  1. dummy code您的类别,即 将类别的每个级别转换为单独的列,该列的值为01

在许多机器学习应用程序中,因素最好作为虚拟代码来处理。

请注意,在2级类别的情况下,根据以下概述的方法编码为数字基本上等同于伪编码:所有非级别0的值都必须为级别{{1} }。实际上,在下面给出的伪代码示例中,存在冗余信息,因为我为2个类中的每个类提供了自己的列。只是为了说明概念。通常,一个人只会创建1列,其中n-1是级别数,而隐含的级别是隐含的( ie n创建一列,并且所有Female的值都隐含为0)。

将类别编码为数字:

方法1:pd.factorize

Male是一种简单,快速的数字编码方式:

例如,如果您的列pd.factorize如下所示:

gender

方法2:categorical dtype

另一种方法是使用>>> df gender 0 Female 1 Male 2 Male 3 Male 4 Female 5 Female 6 Male 7 Female 8 Female 9 Female df['gender_factor'] = pd.factorize(df.gender)[0] >>> df gender gender_factor 0 Female 0 1 Male 1 2 Male 1 3 Male 1 4 Female 0 5 Female 0 6 Male 1 7 Female 0 8 Female 0 9 Female 0 dtype:

category

这将导致相同的输出

方法3 sklearn.preprocessing.LabelEncoder()

此方法具有一些优点,例如易于向后转换:

df['gender_factor'] = df['gender'].astype('category').cat.codes

虚拟代码:

方法1:pd.get_dummies

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

# Easy to back transform:

df['gender_factor'] = le.inverse_transform(df.gender_factor)

>>> df
   gender gender_factor
0  Female        Female
1    Male          Male
2    Male          Male
3    Male          Male
4  Female        Female
5  Female        Female
6    Male          Male
7  Female        Female
8  Female        Female
9  Female        Female

请注意,如果您想省略一列以获得非冗余的伪代码(请参阅本答案开头的注释),则可以使用:

df.join(pd.get_dummies(df.gender))

   gender  Female  Male
0  Female       1     0
1    Male       0     1
2    Male       0     1
3    Male       0     1
4  Female       1     0
5  Female       1     0
6    Male       0     1
7  Female       1     0
8  Female       1     0
9  Female       1     0