使用gridsearchCV来调整更改pandas df的超参数

时间:2018-04-06 00:22:04

标签: python pandas scikit-learn

我想使用gridsearchCV来调整在pandas数据帧上执行的用户定义估算器中的超参数。例如,估算中位数,选择包括将列传递给估算器等等。下面,我举例说明了一个列选择器,但我们的想法是能够以更复杂的方式调整参数。我不断得到一些我无法解读的神秘信息。例如,'list' object has no attribute 'flags'

from sklearn.datasets import california_housing
from sklearn.linear_model import Ridge
from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

cal_house = california_housing.fetch_california_housing()
data      = cal_house['data']
names     = cal_house['feature_names']

df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']

class ColumnSelector(BaseEstimator):
    def __init__(self, columns_for_x = ['MedInc','HouseAge']):

        self.columns     = columns_for_x
        #self.lags        = lags
        #self.grouper_col = grouper_col

    def fit(self, X, y):
        return self


    def transform(self, X, y):

        X = X.loc[:,self.columns].values

        return X, y


pipe       = Pipeline([('colselect', ColumnSelector()),
                        ('Ridge', Ridge())])

gridsearch  = GridSearchCV(cv=5, scoring='r2',
                          param_grid= {'colselect__columns_for_x':[['MedInc','HouseAge'],
                                                                   ['MedInc','Population','Latitude'],
                                                                   ['MedInc','AveRooms','AveOccup']],
                                       'Ridge__alpha':[0.001,0.01,0.1,1,10]}, estimator=pipe)

X = df.drop('houseval', axis = 1).values
y = df.loc[:,'houseval'].values
# gridsearch.fit(X=X,y=y)

1 个答案:

答案 0 :(得分:1)

我偏离了原始代码 - 主要是因为使用自定义估算器来实现所需的列选择转换的复杂性对我来说是过多的开销。

这是我的解决方案:

from sklearn.datasets import california_housing
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

cal_house = california_housing.fetch_california_housing()
data = cal_house['data']
names = cal_house['feature_names']

df = pd.DataFrame(data, columns=names)
df['houseval'] = cal_house['target']


def keep_columns(X, columns=("MedInc", "HouseAge")):
    column_indices = [
        names.index(name) for name in columns
    ]

    return X[:, column_indices]

pipe = Pipeline([
    ("colselect", FunctionTransformer(keep_columns)),
    ("Ridge", Ridge()),
])

gridsearch = GridSearchCV(
    cv=5, scoring='r2',
    param_grid={
        'colselect__inv_kw_args': [
            {"columns": columns}
            for columns in [
                ['MedInc', 'HouseAge'],
                ['MedInc', 'Population', 'Latitude'],
                ['MedInc', 'AveRooms', 'AveOccup']
            ]
        ],
        'Ridge__alpha': [0.001, 0.01, 0.1, 1, 10]
    },
    estimator=pipe
)

X = df.drop('houseval', axis=1).values
y = df.loc[:, 'houseval'].values
gridsearch.fit(X=X, y=y)

您的代码存在的主要问题:

  • 在管道中使用自定义类来实现转换过度 - 它基本上只是一个数组访问。 您在类中的转换代码假定为DataFrame,但gridsearch会传递numpy.array个对象。在这个numpy数组中,访问需要通过索引完成,因此我的代码从您的功能名称数组names计算这些索引。

  • 自定义估算工具要求您提供复制估算工具的方法,这些方法必须确保正确复制所有参数,否则看起来好像是None,因为sklearn.model_selection.GridSearchCV会尝试复制它们没有深刻的复制。