我有一个准备好建模的数据框,它包含连续变量和一个热编码变量
[EditorGUILayout.BeginHorizontal][2]
所有数字变量均为'int64',而一键编码的变量为'uint8'。二进制结果变量为DEFAULT_PAYMT。
在这里,我采用了通常的火车测试拆分方式,但是我想看看是否可以仅对int64变量(即未进行一次热编码的变量)应用standardscaler?
ID Limit Bill_Sep Bill_Aug Payment_Sep Payment_Aug Gender_M Gender_F Edu_Uni DEFAULT_PAYMT
1 10000 2000 350 1000 350 1 0 1 1
2 30000 3000 5000 500 500 0 1 0 0
3 20000 8000 10000 8000 5000 1 0 1 1
4 45000 450 250 450 250 0 1 0 1
5 60000 700 1000 700 1000 1 0 1 1
6 8000 300 5000 300 2000 1 0 1 0
7 30000 3000 10000 1000 5000 0 1 1 1
8 15000 1000 1250 500 1750 0 1 1 1
我正在尝试下面的代码,并且似乎可以正常工作,但是,不确定如何将未缩放的类别变量合并回X_scaled_tr和X_scaled_t数组中。感谢任何形式的帮助,谢谢!
featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
答案 0 :(得分:0)
通过以下代码设法解决了这个问题,其中standardscaler仅应用于连续变量,而不是一键编码的变量
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)