如何使用两个输入编写fit_transformer并将其包含在python sklearn的管道中?

时间:2016-09-19 01:56:35

标签: python scikit-learn pipeline transformer

鉴于一些假数据:

X = pd.DataFrame( np.random.randint(1,10,28).reshape(14,2) )
y = pd.Series( np.repeat([0,1], [10,4]) ) # imbalanced with more 0s than 1s

我写了一个sklearn fit-transformer,它对大部分y进行欠采样,以匹配少数标签的长度。我想在管道中使用它。

from sklearn.base import BaseEstimator, TransformerMixin

class UnderSampling(BaseEstimator, TransformerMixin):
    def fit(self, X, y): # I don't need fit to do anything
        return self

    def transform(self, X, y):
        is_pos = y == 1
        idx_pos = y[is_pos].index
        random.seed(random_state)
        idx_neg = random.sample(y[~is_pos].index, is_pos.sum())
        idx = sorted(list(idx_pos) + list(idx_neg))
        X_resampled = X.loc[idx]
        y_resampled = y.loc[idx]
        return X_resampled, y_resampled

    def fit_transform(self, X, y):
        return self.transform(X,y)

最不幸的是,我不能在管道中使用它。

from sklearn.pipeline import make_pipeline
us = UnderSampling()
rfc = RandomForestClassifier()
model = make_pipeline(us, rfc)
model.fit(X,y)

如何使此管道工作?

1 个答案:

答案 0 :(得分:0)

您并不打算直接在类上调用估算器方法,而是要在类实例上调用它;这是因为估算器通常具有某种类型的存储状态(例如模型系数):

u = UnderSampling()
a,b = u.fit(X, y)
a,b = u.fit_transform(X, y)