如何使用GridSearchCV和管道交叉验证最佳分类编码

时间:2018-02-09 10:02:13

标签: python pandas scikit-learn pipeline grid-search

我想在sklearn中使用GridSearchCV和管道,不仅要为选择的分类器选择最佳的超参数,而且要选择最佳的分类编码策略。 考虑泰坦尼克数据集([https://www.kaggle.com/c/titanic][1])并使用Sklearn-pandas,我可以定义一些DataFrameMappers来选择和编码某些特征,然后交叉验证RandomForestClassifier()以搜索它的最佳超参数。

请考虑以下代码:

from __future__ import division
import csv as csv
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, StandardScaler
from category_encoders import BinaryEncoder, LeaveOneOutEncoder

from sklearn_pandas import DataFrameMapper

df_train = pd.read_csv('train.csv', header = 0, index_col = 'PassengerId')
df_test = pd.read_csv('test.csv', header = 0, index_col = 'PassengerId')
df = pd.concat([df_train, df_test], keys=["train", "test"])

df['Title'] = df['Name'].apply(lambda c: c[c.index(',') + 2 : c.index('.')])

df['LastName'] = df['Name'].apply(lambda n: n[0:n.index(',')])

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].mode()[0]

df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].mode()[0]

df['FamilyID'] = df['LastName'] + ':' + df['FamilySize'].apply(str)

df.loc[df['FamilySize'] <= 2, 'FamilyID'] = 'Small_Family'

df['AgeOriginallyNaN'] = df['Age'].isnull().astype(int)

medians_by_title = pd.DataFrame(df.groupby('Title')['Age'].median()).rename(columns = {'Age': 'AgeFilledMedianByTitle'})

df = df.merge(medians_by_title, left_on = 'Title', right_index = True).sort_index(level = 0).sort_index(level = 1)

df_train = df.loc['train']
df_test  = df.loc['test']

y_train = df_train['Survived']
X_train = df_train[df_train.columns.drop('Survived')]

mapper1 = DataFrameMapper([
     ('Embarked',BinaryEncoder()),
     (['AgeFilledMedianByTitle'], StandardScaler()),
     ('Pclass', LeaveOneOutEncoder())
 ])

mapper2=DataFrameMapper([
     ('Embarked',LeaveOneOutEncoder()),
     (['AgeFilledMedianByTitle'], StandardScaler()),
     ('Pclass', LeaveOneOutEncoder())
 ])



pipe = Pipeline([('featurize', mapper1),
                 ('forest', RandomForestClassifier(n_estimators=10))])

param_grid = dict(forest__n_estimators = [2, 16, 32,64], 
                  forest__criterion = ['gini', 'entropy'])

grid_search = GridSearchCV(pipe, param_grid=param_grid, scoring='accuracy')

best_pipeline = grid_search.fit(X_train, y_train).best_estimator_
best_pipeline.get_params()['forest']
grid_search.best_score_

是否可以在GridSearchCV中使用Pipeline来选择最佳的映射器(mapper1和mapper2)?怎么样?

0 个答案:

没有答案