带tfidf的sklearn自定义管道

时间:2020-02-07 09:20:58

标签: python pandas scikit-learn tfidfvectorizer

我试图用自己的FeatureSelector和TF-IDF矢量化器创建sklearn管道。但是没有成功。

class FeatureSelector(BaseEstimator, TransformerMixin):

    def __init__(self, feature_names):
        self.feature_names = feature_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.feature_names]

tfidf_vect = TfidfVectorizer(max_features=50000,ngram_range=(1,2))

feature_pipeline = make_pipeline( (FeatureSelector(['text']) ))

full_pipeline = Pipeline( steps = [( 'feature_pipeline', feature_pipeline ),('tfidf',tfidf_vect),('clf',SVM)])

full_pipeline.fit(train_x,y_train)


它向我显示以下错误。

“ ValueError:找到样本数量不一致的输入变量:[1,30597]”

0 个答案:

没有答案