在sklearn中对分类器进行子类化

时间:2019-07-25 09:37:48

标签: python scikit-learn

我正在尝试将sklearn VotingClassifier修改为在RandomSearchCV中使用。 这样的想法是,对于更多数量的分类器,可能的权重组合会爆炸,并且可以通过单独的权重选择而不是许多不同的元组更好地表示。同样,由于权重变化中包含信息,因此可以更改为更智能的超参数调整方法。

因此,如何正确地将VotingClassifier类子类化为以下代码,导致将None传递给权重或使用默认值,并且搜索抱怨权重不受参数(它们是参数)的控制。

class VotingClassifier2(VotingClassifier):
    def __init__(self, estimators, w1, w2, voting='soft', weights=None, n_jobs=None, flatten_transform=True):
        super().__init__(estimators, voting, weights, n_jobs, flatten_transform)
        if w1:
            tot=w1+w2 
        else:
            breakpoint()
        self.weights = (w1/tot, w2/tot)


pipe = Pipeline(
    [
        [
            "vc",
            VotingClassifier2(
                estimators=[
                    ("xgb", XGBClassifier()),
                    ('lr', LogisticRegression(fit_intercept=True, max_iter=300, solver='lbfgs'))

                ],
                voting="soft",
                weights=None,
                w1=1,
                w2=0

            ),
        ]
    ]
)


opt = RandomizedSearchCV(
    pipe,
    { 
        "vc__w1": uniform(0.1, 1),   
        "vc__w2": uniform(0.1, 1)
    },
    n_iter=5,
    cv=5,
    n_jobs=25,
    return_train_score=False,
    error_score='raise' 
)

最初调用时,w1和w2为None,但已经根据需要从输入中计算了权重。然后搜索运行并且无法设置它们。

RuntimeError: Cannot clone object VotingClassifier2(estimators=[('xgb', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objectiv...alty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]))],
         flatten_transform=True, n_jobs=None, voting='soft', w1=None,
         w2=None, weights=(1.0, 0.0)), as the constructor either does not set or modifies parameter weights

3 个答案:

答案 0 :(得分:2)

RandomizedSearchCV通过属性更改估算器的参数,因此,如果要与weightsw1一起修改w2属性,可以将它们与property一起包装装饰。另一种选择是直接包装weights,例如:

import scipy as sp
from dask_ml.xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import RandomizedSearchCV


class VotingClassifier2(VotingClassifier):
    @property
    def weights(self):
        return self._weights

    @weights.setter
    def weights(self, value): 
        if isinstance(value, float):
            value = [value, 1-value]
        self._weights = value


# setup a client based on your environment 
client = ...

pipe = Pipeline(
    [
        [
            "vc",
            VotingClassifier2(
                estimators=[
                    ("xgb", XGBClassifier(sheduler=client)),
                    ('lr', LogisticRegression(fit_intercept=True, max_iter=300, solver='lbfgs'))

                ],
                voting="soft",
                weights=[.5, .5],
            ),
        ]
    ]
)


opt = RandomizedSearchCV(
    pipe,
    { 
        "vc__weights": sp.stats.uniform(0.1, 1),   
    },
    n_iter=5,
    cv=5,
    n_jobs=25,
    return_train_score=False,
    error_score='raise' 
)

编辑: 如果您确实需要使用w1w2,则应将它们绑定到weights,并从weights方法参数中删除__init__


class VotingClassifier2(VotingClassifier):
    def __init__(self, estimators, w1, w2, voting='soft', n_jobs=None, flatten_transform=True):
        super().__init__(estimators, voting, [w1, w2], n_jobs, flatten_transform)
        self.w1 = w1
        self.w2 = w2

    @property
    def w1(self):
        return self.weights[0]

    @w1.setter
    def w1(self, value): 
        if value is not None:
            self.weights[0] = value

    @property
    def w2(self):
        return self.weights[1]

    @w2.setter
    def w2(self, value): 
        if value is not None:
            self.weights[1] = value

答案 1 :(得分:1)

对于您的问题,可能会有一个更简单,更具扩展性的解决方案。您可以尝试使用堆叠方法,而不是尝试使用网格搜索来搜索最佳参数组合。对于堆栈,基本学习者是您的xgb和lr,您可以选择线性回归作为元学习者。然后,您只需让线性回归找出每个基础学习者的系数(权重)即可。

答案 2 :(得分:0)

我建议使用ParameterSampler中的RandomizedSearchCV预先创建权重元组,而不是将此方面合并到类中。

然后用户可以使用weights的{​​{1}}输出到ParameterSampler

GridSearchCV

请记住,这样做并不会减少权重的搜索空间,我们只是在对权重进行归一化。因此,计算复杂度仍然相同。

我建议使用贝叶斯优化代替from sklearn.model_selection import ParameterSampler from sklearn.model_selection import GridSearchCV from xgboost import XGBClassifier from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.ensemble import VotingClassifier from scipy.stats.distributions import uniform params = {'w1': uniform(), 'w2': uniform()} params_order = list(params) param_list = list(ParameterSampler( params, n_iter=2, random_state=1)) weights = [] for d in param_list: total = sum(d.values()) weights.append(tuple(d[k]/total for k in params_order)) weights #[(0.36666223123984704, 0.633337768760153), # (0.00037816489242002753, 0.99962183510758)] pipe = Pipeline([["vc", VotingClassifier(estimators=[("xgb", XGBClassifier()), ('lr', LogisticRegression(fit_intercept=True, max_iter=3, solver='lbfgs'))], voting="soft", weights=None)]]) opt = GridSearchCV( pipe, { "vc__weights": weights, }, cv=5, n_jobs=25, return_train_score=False, error_score='raise' ) from sklearn.datasets import load_iris X,y = load_iris(return_X_y=True) opt.fit(X,y) / GridSearchCV

尝试使用BayesSearchCVHyperopt

更新:

当您尝试使用贝叶斯优化时,我们可以远离RandomSearchCV

如果您可以使用hyperopt,则该解决方案非常简单,可以处理此单个权重调整。

看看下面的例子:

GridSearchCV