因此,我目前正在使用python进行机器学习项目。数据同时包含分类列和数字列。为了避免数据监听,并保持训练集和测试集的范围,我必须对两个预处理拆分步骤:
在进行训练/测试拆分之前,我应该对分类特征进行一次热编码。
训练/测试拆分后,我将继续使用数字特征的标准缩放器。
但是,当我尝试执行上述步骤时,遇到了一些问题。
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1,
test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Revenue"]):
train_set = data.loc[train_index]
test_set = data.loc[test_index]
cat_attribs = list(data.columns)[10:17]
num_attribs = list(data.columns)[0:5] + list(data.columns)[5:10]
features = cat_attribs+num_attribs
X_raw = data[features]
#one hot encoding pre-split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
onehot_encoder = OneHotEncoder(sparse=False)
X_cat_1hot = onehot_encoder.fit_transform(X_raw[cat_attribs])
#standard scaling after train-test split
X_num_train = X_raw[num_attribs].loc[train_index]
sc = StandardScaler()
X_num_train1 = sc.fit_transform(X_num_train)
#error occurs, the dimensions do not conform
X_train1 = X_cat_1hot[train_index]+X_train_num
ValueError跟踪(最近一次通话最近) 在 ----> 1 X_train1 = X_cat_1hot [train_index] + X_train_num
ValueError:操作数不能与形状(9864,65)(9864,10)一起广播