我对机器学习非常陌生,我正在尝试了解为机器学习部分准备数据的整个过程。我正在使用sklearn中的管道。
可以说我有以下数据集:
df.head()
Out[16]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob \
0 GP F 18 U GT3 A 4 4 at_home teacher
1 GP F 17 U GT3 T 1 1 at_home other
2 GP F 15 U LE3 T 1 1 at_home other
3 GP F 15 U GT3 T 4 2 health services
4 GP F 16 U GT3 T 3 3 other other
reason guardian traveltime studytime failures schoolsup famsup paid \
0 course mother 2 2 0 yes no no
1 course father 1 2 0 no yes no
2 other mother 1 2 3 yes no yes
3 home mother 1 3 0 no yes yes
4 home father 1 2 0 no yes yes
activities nursery higher internet romantic famrel freetime goout Dalc \
0 no yes yes no no 4 3 4 1
1 no no yes yes no 5 3 3 1
2 no yes yes yes no 4 3 2 2
3 yes yes yes yes yes 3 2 2 1
4 no yes yes no no 4 3 2 1
Walc health absences G1 G2 G3
0 1 3 6 5 6 6
1 1 3 4 5 5 6
2 3 3 10 7 8 10
3 1 5 2 15 14 15
4 2 5 4 6 10 10
正如你们所看到的,有很多类别变量。我确实知道您可以使用pd.get_dummies进行OneHot编码,但是我想使用管道来实现相同的目的。到目前为止,这是我尝试过的:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
cat = ['school', 'sex','address', 'famsize', 'Pstatus','Mjob', 'Fjob', 'reason', 'guardian','schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic']
nom = ['schoolsup', 'famsup', 'paid', 'activities', 'nursery','higher', 'internet', 'romantic']
ord = ['school', 'sex', 'address', 'famsize', 'Pstatus','Mjob', 'Fjob', 'reason', 'guardian']
num = ['age','Medu', 'Fedu','traveltime', 'studytime','failures','famrel', 'freetime', 'goout', 'Dalc',
'Walc', 'health', 'absences']
### Pipeline for categorical variables which are Yes/No
ord_pipeline = Pipeline([
("onehot",OneHotEncoder()),
])
### Pipeline for categorical variables which have multiple categories
nom_pipeline = Pipeline([
("ordinal",OrdinalEncoder()),
])
num_pipeline = Pipeline([
("scaler", StandardScaler()),
])
ct = ColumnTransformer(transformers = [
("ord",ord_pipeline,ord),
("nom",nom_pipeline,nom),
("num",num_pipeline,num)])
Xprep = ct.fit_transform(df)
现在上面的结果是一个数组...。没有列信息。所以我想在输入变量和目标变量之间建立关联,我不确定该怎么做...
有人有什么想法吗?还是我理解错了?
谢谢