将自定义函数放入Sklearn管道中

时间:2015-07-07 04:44:07

标签: machine-learning scikit-learn pipeline cross-validation feature-selection

在我的分类方案中,有几个步骤,包括:

  1. SMOTE(合成少数民族过度采样技术)
  2. Fisher特征选择标准
  3. 标准化(Z分数归一化)
  4. SVC(支持向量分类器)
  5. 在上面的方案中要调整的主要参数是百分位数(2.)和SVC的超参数(4.),我想通过网格搜索进行调整。

    当前的解决方案构建了一个" partial"管道包括方案clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])中的步骤3和4 并将计划分为两部分:

    1)调整要素的百分位数以保持第一次网格搜索

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for percentile in percentiles:
            # Fisher returns the indices of the selected features specified by the parameter 'percentile'
            selected_ind = Fisher(X_train, y_train, percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

    将存储f1分数,然后通过所有百分位数的所有折叠分区进行平均,并返回具有最佳CV分数的百分位数。将“百分位数”用于循环的目的是为了实现这一目标。因为我们在所有百分位数的所有折叠分区上具有相同的训练数据(包括合成数据),因此内循环允许公平竞争。

    2)确定百分位数后,通过第二次网格搜索调整超参数

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for parameters in parameter_comb:
            # Select the features based on the tuned percentile
            selected_ind = Fisher(X_train, y_train, best_percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    

    它以非常类似的方式完成,除了我们调整SVC的超参数而不是要选择的特征的百分位数。

    我的问题是:

    I)在当前的解决方案中,我只涉及clf中的3.和4.并且执行1.和2.有点"手动"在如上所述的两个嵌套循环中。有没有办法在管道中包含所有四个步骤并立即执行整个过程?

    II)如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环

    clf_all = Pipeline([('smote', SMOTE()),
                        ('fisher', Fisher(percentile=best_percentile))
                        ('normal',preprocessing.StandardScaler()),
                        ('svc',svm.SVC(class_weight='auto'))]) 
    

    并简单地使用GridSearchCV(clf_all, parameter_comb)进行调整?

    请注意,SMOTEFisher(排名标准)必须仅针对每个折叠分区中的训练数据进行。

    任何评论都会非常感激。

    编辑 SMOTEFisher如下所示:

    def Fscore(X, y, percentile=None):
        X_pos, X_neg = X[y==1], X[y==0]
        X_mean = X.mean(axis=0)
        X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
        deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
        num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
        F = num/deno
        sort_F = argsort(F)[::-1]
        n_feature = (float(percentile)/100)*shape(X)[1]
        ind_feature = sort_F[:ceil(n_feature)]
        return(ind_feature)
    

    SMOTE来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py,它返回合成数据。我修改它以返回与合成数据一起堆叠的原始输入数据及其标签和合成数据。

    def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)
    

2 个答案:

答案 0 :(得分:3)

我不知道您的SMOTE()Fisher()函数来自何处,但答案是肯定的,您肯定可以这样做。为此,您需要编写围绕这些函数的包装类。最简单的方法是继承sklearn的BaseEstimatorTransformerMixin类,请参阅此示例:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

如果这对您没有意义,请发布至少一个功能的详细信息(它来自的库或您自己编写的代码),我们可以从那里开始。

编辑:

我道歉,我没有仔细研究你的功能,除了你的训练数据(即X和y)之外,他们还意识到他们会改变你的目标。 Pipeline不支持对目标进行转换,因此您可以像以前一样对它们进行转换。作为参考,以下是为Fisher过程编写自定义类的示例,如果函数本身不需要影响目标变量,它将起作用。

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

答案 1 :(得分:1)

scikit在版本0.17中创建了FunctionTransformer作为预处理类的一部分。可以与上述答案中David的Fisher类的实现类似的方式使用它-但灵活性较低。如果正确配置了函数的输入/输出,则转换器可以为该函数实现fit / transform / fit_transform方法,从而使其可以在scikit管道中使用。

例如,如果管道的输入为串联,则变压器将如下所示:

def trans_func(input_series):
return output_series

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)

其中vect是一个tf_idf转换器,clf是一个分类器,train是训练数据集。 “ train.desc”是输入到管道的系列文本。