以下是我从教程中获得的内容
# Data Preprocessing
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
这是带有编码虚拟变量的X矩阵
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 4.400000000000000000e+01 7.200000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 2.700000000000000000e+01 4.800000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 3.000000000000000000e+01 5.400000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 3.800000000000000000e+01 6.100000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 4.000000000000000000e+01 6.377777777777778101e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 3.500000000000000000e+01 5.800000000000000000e+04
0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00 3.877777777777777857e+01 5.200000000000000000e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 4.800000000000000000e+01 7.900000000000000000e+04
0.000000000000000000e+00 1.000000000000000000e+00 0.000000000000000000e+00 5.000000000000000000e+01 8.300000000000000000e+04
1.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 3.700000000000000000e+01 6.700000000000000000e+04
问题是没有列标签。我试过了
something = pd.get_dummies(X)
但我得到以下例外
Exception: Data must be 1-dimensional
答案 0 :(得分:4)
大多数sklearn
方法都不关心列名,因为它们主要关注它们实现的ML算法背后的数学。如果您可以提前确定标签编码,则可以在OneHotEncoder
之后将列名添加回fit_transform()
输出。
首先,从原始dataset
中获取预测变量的列名,不包括第一个(我们为LabelEncoder
保留的):
X_cols = dataset.columns[1:-1]
X_cols
# Index(['Age', 'Salary'], dtype='object')
现在获取编码标签的顺序。在这种特殊情况下,看起来LabelEncoder()
按字母顺序组织其整数映射:
labels = labelencoder_X.fit(X[:, 0]).classes_
labels
# ['France' 'Germany' 'Spain']
合并这些列名,然后在转换为X
时将其添加到DataFrame
:
# X gets re-used, so make sure to define encoded_cols after this line
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
encoded_cols = np.append(labels, X_cols)
# ...
X = onehotencoder.fit_transform(X).toarray()
encoded_df = pd.DataFrame(X, columns=encoded_cols)
encoded_df
France Germany Spain Age Salary
0 1.0 0.0 0.0 44.000000 72000.000000
1 0.0 0.0 1.0 27.000000 48000.000000
2 0.0 1.0 0.0 30.000000 54000.000000
3 0.0 0.0 1.0 38.000000 61000.000000
4 0.0 1.0 0.0 40.000000 63777.777778
5 1.0 0.0 0.0 35.000000 58000.000000
6 0.0 0.0 1.0 38.777778 52000.000000
7 1.0 0.0 0.0 48.000000 79000.000000
8 0.0 1.0 0.0 50.000000 83000.000000
9 1.0 0.0 0.0 37.000000 67000.000000
NB:例如我正在使用this dataset的数据,它看起来与OP使用的数据非常相似或相同。注意输出如何与OP的X
矩阵相同。