使用sklearn Pipeline的索引提取子管道时出错

时间:2020-05-11 09:49:38

标签: python python-2.7 scikit-learn pipeline

我有一条机器学习管道-

logreg = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
                   ('tfidf', TfidfTransformer(sublinear_tf=True, use_idf=True)),
                   ('clf', LogisticRegression(n_jobs=-1, C=1e2, multi_class='ovr', 
                                              solver='lbfgs', max_iter=1000))])

logreg.fit(X_train, y_train)

我想从管道的前两个步骤中提取特征矩阵。因此,我尝试从原始管道的前两个步骤中提取子管道。以下代码给出错误:

logreg[:-1].fit(X)

TypeError:“管道”对象没有属性“ getitem

如何在不建立用于数据转换的新管道的情况下提取Pipeline的前两个步骤?

2 个答案:

答案 0 :(得分:1)

我只想执行可以在运行时创建管道的部分步骤。

partial_pipe = Pipeline(logreg.steps[:-1])
partial_pipe.fit(data)

Piple的步骤将在Pipeline对象的steps变量中提供。

答案 1 :(得分:0)

我认为您使用的是旧版本的sklearn。对于版本from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.datasets import fetch_20newsgroups from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split categories = ['alt.atheism', 'talk.religion.misc'] newsgroups_train = fetch_20newsgroups(subset='train', categories=categories) X, y = newsgroups_train.data, newsgroups_train.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y) logreg = Pipeline([('vect', CountVectorizer(ngram_range=(1, 1))), ('tfidf', TfidfTransformer(sublinear_tf=True, use_idf=True)), ('clf', LogisticRegression(n_jobs=-1, C=1e2, multi_class='ovr', solver='lbfgs', max_iter=1000))]) logreg.fit(X_train, y_train) ,应该可以按照您的方式为管道建立索引。

您可以看到发行说明here

示例:

logreg[:-1].fit_transform(X_train)

# <599x15479 sparse matrix of type '<class 'numpy.float64'>'
#   with 107539 stored elements in Compressed Sparse Row format>

pip3 install "package_name" -t "target_dir"