Sklearn管道和弱相关变量的去除

时间:2019-08-14 19:06:11

标签: python pandas encoding scikit-learn categorical-data

我对机器学习非常陌生,我正在尝试了解为机器学习部分准备数据的整个过程。我正在使用sklearn中的管道。

可以说我有以下数据集:

df.head()
Out[16]: 
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

   reason guardian  traveltime  studytime  failures schoolsup famsup paid  \
0  course   mother           2          2         0       yes     no   no   
1  course   father           1          2         0        no    yes   no   
2   other   mother           1          2         3       yes     no  yes   
3    home   mother           1          3         0        no    yes  yes   
4    home   father           1          2         0        no    yes  yes   

  activities nursery higher internet romantic  famrel  freetime  goout  Dalc  \
0         no     yes    yes       no       no       4         3      4     1   
1         no      no    yes      yes       no       5         3      3     1   
2         no     yes    yes      yes       no       4         3      2     2   
3        yes     yes    yes      yes      yes       3         2      2     1   
4         no     yes    yes       no       no       4         3      2     1   

   Walc  health  absences  G1  G2  G3  
0     1       3         6   5   6   6  
1     1       3         4   5   5   6  
2     3       3        10   7   8  10  
3     1       5         2  15  14  15  
4     2       5         4   6  10  10  

正如你们所看到的,有很多类别变量。我确实知道您可以使用pd.get_dummies进行OneHot编码,但是我想使用管道来实现相同的目的。到目前为止,这是我尝试过的:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

cat = ['school', 'sex','address', 'famsize', 'Pstatus','Mjob', 'Fjob', 'reason', 'guardian','schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic']
nom = ['schoolsup', 'famsup', 'paid', 'activities', 'nursery','higher', 'internet', 'romantic']
ord = ['school', 'sex', 'address', 'famsize', 'Pstatus','Mjob', 'Fjob', 'reason', 'guardian']

num = ['age','Medu', 'Fedu','traveltime', 'studytime','failures','famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences']

### Pipeline for categorical variables which are Yes/No
ord_pipeline = Pipeline([
    ("onehot",OneHotEncoder()),
])

### Pipeline for categorical variables which have multiple categories
nom_pipeline = Pipeline([
    ("ordinal",OrdinalEncoder()),
])

num_pipeline = Pipeline([
                        ("scaler", StandardScaler()),
                        ])

ct = ColumnTransformer(transformers = [

    ("ord",ord_pipeline,ord),

    ("nom",nom_pipeline,nom),

    ("num",num_pipeline,num)])

Xprep = ct.fit_transform(df)

现在上面的结果是一个数组...。没有列信息。所以我想在输入变量和目标变量之间建立关联,我不确定该怎么做...

有人有什么想法吗?还是我理解错了?

谢谢

0 个答案:

没有答案