我正在尝试处理泰坦尼克号数据集。数据具有分类值,因此我使用了labelEncoder将数据更改为数字而不是文本。之前:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.00 1 0 7.2500 S
1 2 1 1 female 38.00 1 0 71.2833 C
2 3 1 3 female 26.00 0 0 7.9250 S
之后:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 1 22.00 1 0 7.2500 2
1 2 1 1 0 38.00 1 0 71.2833 0
2 3 1 3 0 26.00 0 0 7.9250 2
这是代码:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
现在,由于乘客的性别具有相同的重要性,因此我想使用oneHotEncoder。据我了解,数据应如下所示:
PassengerId Survived Pclass Male Female Age SibSp Parch Fare Embarked
0 1 0 3 1 0 22.00 1 0 7.2500 2
1 2 1 1 0 1 38.00 1 0 71.2833 0
2 3 1 3 0 1 26.00 0 0 7.9250 2
如何编写代码来做到这一点?我曾尝试对oneHotEncoder使用类似的方法:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
data['Embarked'] = labelencoder_X.fit_transform(data['Embarked'])
data['Sex'] = labelencoder_X.fit_transform(data['Sex'])
onehotencoder = OneHotEncoder()
data['Embarked'] = onehotencoder.fit_transform(data['Embarked'].values.reshape(-1,1))
但是它只会返回相同的结果。我该如何解决?我是Scikit和ML的新手,我希望我做的事正确。
答案 0 :(得分:1)
这是您的操作方式。
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
Sex
0 1
1 0
2 0
3 1
# OneHotEncoder
result = OneHotEncoder().fit_transform(df['Sex'].reshape(-1, 1)).toarray()
# Appending columns
df[['Female', 'Male']] = pd.DataFrame(result, index = df.index)
# Resulting dataframe
df
Sex Female Male
0 1 0.0 1.0
1 0 1.0 0.0
2 0 1.0 0.0
3 1 0.0 1.0