如何分析sklearn管道的中间步骤?

时间:2019-01-23 17:30:52

标签: python python-3.x machine-learning scikit-learn

我正在使用sklearn将文本分类。我正在使用CountVectorizer和TFIDFTransformer创建稀疏矩阵。

我正在CountVectorizer令牌生成器中使用的自定义tokenize_and_stem函数中对字符串执行几个预处理步骤。

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

SVM = Pipeline([('vect', CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),stop_words='english',tokenizer=tokenize_and_stem)),\
                         ('tfidf', TfidfTransformer(use_idf= True)),\
                         ('clf-svm', LinearSVC(C=1)),])

我的问题是,是否有任何简便的方法可以查看/存储Pipeline步骤1/2的输出,以分析将哪种数组放入svm?

2 个答案:

答案 0 :(得分:0)

来自the docs

  

named_steps:束对象,具有属性访问权限的字典   只读属性,用于通过用户给定名称访问任何步骤参数。   键是步骤名称,值是步骤参数。

您应该能够像访问字典一样访问元素

SVM.named_steps['vect'] 

答案 1 :(得分:0)

您可以使用以下内容获得中间步骤的输出。

基于source code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range= (1, 2),stop_words='english')),\
                     ('clf-svm', LinearSVC(C=1)),])
X= ["I want to test this document", "let us see how it works", "I am okay and you ?"]

pipeline.fit(X,[0,1,1])

print(pipeline.named_steps['vect'].get_feature_names())

['document', 'let', 'let works', 'okay', 'test', 'test document', 'want', 'want test', 'works']    

#Here is where you can get the output of intermediate steps
Xt = X

for name, transform in pipeline.steps[:-1]:
    if transform is not None:
        Xt = transform.transform(Xt)

print(Xt)



(0, 7)  0.4472135954999579
  (0, 6)    0.4472135954999579
  (0, 5)    0.4472135954999579
  (0, 4)    0.4472135954999579
  (0, 0)    0.4472135954999579
  (1, 8)    0.5773502691896257
  (1, 2)    0.5773502691896257
  (1, 1)    0.5773502691896257
  (2, 3)    1.0