我正在使用sklearn将文本分类。我正在使用CountVectorizer和TFIDFTransformer创建稀疏矩阵。
我正在CountVectorizer令牌生成器中使用的自定义tokenize_and_stem
函数中对字符串执行几个预处理步骤。
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
SVM = Pipeline([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),stop_words='english',tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('clf-svm', LinearSVC(C=1)),])
我的问题是,是否有任何简便的方法可以查看/存储Pipeline步骤1/2的输出,以分析将哪种数组放入svm?
答案 0 :(得分:0)
来自the docs:
named_steps:束对象,具有属性访问权限的字典 只读属性,用于通过用户给定名称访问任何步骤参数。 键是步骤名称,值是步骤参数。
您应该能够像访问字典一样访问元素
SVM.named_steps['vect']
答案 1 :(得分:0)
您可以使用以下内容获得中间步骤的输出。
基于source code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range= (1, 2),stop_words='english')),\
('clf-svm', LinearSVC(C=1)),])
X= ["I want to test this document", "let us see how it works", "I am okay and you ?"]
pipeline.fit(X,[0,1,1])
print(pipeline.named_steps['vect'].get_feature_names())
['document', 'let', 'let works', 'okay', 'test', 'test document', 'want', 'want test', 'works']
#Here is where you can get the output of intermediate steps
Xt = X
for name, transform in pipeline.steps[:-1]:
if transform is not None:
Xt = transform.transform(Xt)
print(Xt)
(0, 7) 0.4472135954999579
(0, 6) 0.4472135954999579
(0, 5) 0.4472135954999579
(0, 4) 0.4472135954999579
(0, 0) 0.4472135954999579
(1, 8) 0.5773502691896257
(1, 2) 0.5773502691896257
(1, 1) 0.5773502691896257
(2, 3) 1.0