如何合并由两个不同的转换器转换的预处理数据?

时间:2019-06-07 22:32:52

标签: python python-3.x one-hot-encoding

因此,我目前正在使用python进行机器学习项目。数据同时包含分类列和数字列。为了避免数据监听,并保持训练集和测试集的范围,我必须对两个预处理拆分步骤:

在进行训练/测试拆分之前,我应该对分类特征进行一次热编码。

训练/测试拆分后,我将继续使用数字特征的标准缩放器。

但是,当我尝试执行上述步骤时,遇到了一些问题。


from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1,
                               test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Revenue"]):
    train_set = data.loc[train_index]
    test_set = data.loc[test_index] 

cat_attribs = list(data.columns)[10:17]
num_attribs = list(data.columns)[0:5] + list(data.columns)[5:10]
features = cat_attribs+num_attribs
X_raw = data[features]
#one hot encoding pre-split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

onehot_encoder =  OneHotEncoder(sparse=False)
X_cat_1hot = onehot_encoder.fit_transform(X_raw[cat_attribs])

#standard scaling after train-test split
X_num_train = X_raw[num_attribs].loc[train_index]
sc = StandardScaler()
X_num_train1 = sc.fit_transform(X_num_train)


#error occurs, the dimensions do not conform 

X_train1 = X_cat_1hot[train_index]+X_train_num

ValueError跟踪(最近一次通话最近)  在 ----> 1 X_train1 = X_cat_1hot [train_index] + X_train_num

ValueError:操作数不能与形状(9864,65)(9864,10)一起广播

0 个答案:

没有答案