我想在sklearn中使用GridSearchCV和管道,不仅要为选择的分类器选择最佳的超参数,而且要选择最佳的分类编码策略。 考虑泰坦尼克数据集([https://www.kaggle.com/c/titanic][1])并使用Sklearn-pandas,我可以定义一些DataFrameMappers来选择和编码某些特征,然后交叉验证RandomForestClassifier()以搜索它的最佳超参数。
请考虑以下代码:
from __future__ import division
import csv as csv
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, StandardScaler
from category_encoders import BinaryEncoder, LeaveOneOutEncoder
from sklearn_pandas import DataFrameMapper
df_train = pd.read_csv('train.csv', header = 0, index_col = 'PassengerId')
df_test = pd.read_csv('test.csv', header = 0, index_col = 'PassengerId')
df = pd.concat([df_train, df_test], keys=["train", "test"])
df['Title'] = df['Name'].apply(lambda c: c[c.index(',') + 2 : c.index('.')])
df['LastName'] = df['Name'].apply(lambda n: n[0:n.index(',')])
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].mode()[0]
df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].mode()[0]
df['FamilyID'] = df['LastName'] + ':' + df['FamilySize'].apply(str)
df.loc[df['FamilySize'] <= 2, 'FamilyID'] = 'Small_Family'
df['AgeOriginallyNaN'] = df['Age'].isnull().astype(int)
medians_by_title = pd.DataFrame(df.groupby('Title')['Age'].median()).rename(columns = {'Age': 'AgeFilledMedianByTitle'})
df = df.merge(medians_by_title, left_on = 'Title', right_index = True).sort_index(level = 0).sort_index(level = 1)
df_train = df.loc['train']
df_test = df.loc['test']
y_train = df_train['Survived']
X_train = df_train[df_train.columns.drop('Survived')]
mapper1 = DataFrameMapper([
('Embarked',BinaryEncoder()),
(['AgeFilledMedianByTitle'], StandardScaler()),
('Pclass', LeaveOneOutEncoder())
])
mapper2=DataFrameMapper([
('Embarked',LeaveOneOutEncoder()),
(['AgeFilledMedianByTitle'], StandardScaler()),
('Pclass', LeaveOneOutEncoder())
])
pipe = Pipeline([('featurize', mapper1),
('forest', RandomForestClassifier(n_estimators=10))])
param_grid = dict(forest__n_estimators = [2, 16, 32,64],
forest__criterion = ['gini', 'entropy'])
grid_search = GridSearchCV(pipe, param_grid=param_grid, scoring='accuracy')
best_pipeline = grid_search.fit(X_train, y_train).best_estimator_
best_pipeline.get_params()['forest']
grid_search.best_score_
是否可以在GridSearchCV中使用Pipeline来选择最佳的映射器(mapper1和mapper2)?怎么样?