如何在管道中使用适当的FunctionTransformer制作GridSearchCV?

时间:2019-08-12 13:15:05

标签: python-3.x machine-learning scikit-learn deep-learning

我正在尝试使用GridSearchCV创建管道以过滤数据(使用iforest)并使用StandarSclaler + MLPRegressor执行回归。

我做了一个FunctionTransformer,将我的iForest过滤器包含在管道中。我还为iForest过滤器定义了参数网格(使用kw_args方法)。

一切似乎都还可以,但是当进行安装时,什么也没发生...没有错误消息。没事。

之后,当我要进行预测时,出现消息:“此RandomizedSearchCV实例尚未安装”

from sklearn.preprocessing import FunctionTransformer

#Definition of the function auto_filter using the iForest algo
def auto_filter(DF, conta=0.1):
    #iForest made on the DF dataframe
    iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta)
    iforest = iforest.fit(DF)

    # The DF (dataframe in input) is filtered taking into account only the inlier observations

data_filtered = DF[iforest.predict(DF) == 1]

    # Only few variables are kept for the next step (regression by MLPRegressor)
    # this function delivers X_filtered and y
    X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']]
    y = data_filtered['MaxVO2']
    return (X_filtered, y)

#Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor')    
pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))]

#Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter)
parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)}   

pipeline = Pipeline(pipeline_steps)

estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10)
estimator.fit(X_train, y_train)

2 个答案:

答案 0 :(得分:1)

您可以尝试手动逐步查找问题:

auto_filter_transformer = FunctionTransformer(auto_filter)
X_train = auto_filter_transformer.fit_transform(X_train)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

MLPR = MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000)
MLPR.fit(X_train, y_train)

如果每个步骤都工作正常,请建立一个管道。检查管道。如果正常,请尝试使用RandomizedSearchCV

答案 1 :(得分:0)

func的{​​{1}}参数应该是可调用的,接受 与FunctionTransformer方法的参数相同(形状类似于数组的transform X的{​​{1}}和(n_samples, n_features)),并返回经过转换的kwargs 相同形状。您的功能func不符合这些要求。

此外,无法使用scikit-learn的异常/异常检测技术 由于管道组装在一起,因此在scikit-learn管道中用作中间步骤 一个或多个 transformers 和一个可选的最终估算器。 X或, 例如,OneClassSVM不是转换器:它实现了auto_filterIsolationForest。 因此,可能的解决方案是分别切除可能的异常值并构建 由变压器和回归器组成的管道:

fit

问题是您将无法优化的超参数 predict。处理它的一种方法是定义超参数空间 对于森林,请使用ParameterSamplerParameterGrid,预测离群值并拟合随机搜索:

>>> import warnings
>>> from sklearn.exceptions import ConvergenceWarning
>>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore')
>>> import numpy as np
>>> from scipy import stats
>>> from sklearn.datasets import make_regression
>>> from sklearn.ensemble import IsolationForest
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.neural_network import MLPRegressor
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2)
>>> detect = IsolationForest(contamination=0.1, behaviour='new')
>>> inliers_mask = detect.fit_predict(X) == 1
>>> pipe = Pipeline([('scale', StandardScaler()),
...                  ('estimate', MLPRegressor(max_iter=500, tol=1e-5))])
>>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1))
>>> search = RandomizedSearchCV(pipe, param_distributions,
...                             n_iter=2, cv=3, iid=True)
>>> search = search.fit(X[inliers_mask], y[inliers_mask])