是否有一种方便的机制来锁定scikit-learn管道中的步骤以防止它们在pipeline.fit()上重新加载?例如:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]
pipeline = Pipeline([
("vectorizer", CountVectorizer()),
("estimator", LinearSVC())
])
# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)
# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))
# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))
在没有中间转换的情况下,我能想到这样做的唯一方法是定义一个自定义估算器类(如here所示),其拟合方法不起作用,其变换方法是预拟变换器的变换。这是唯一的方法吗?
答案 0 :(得分:2)
查看代码,在Pipeline对象中似乎没有任何东西具有这样的功能:在管道上调用.fit()会在每个阶段产生.fit()。
我能想到的最好的快速和肮脏的解决方案是修补舞台的适合功能:
pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform
pipeline.fit(X, y)
答案 1 :(得分:0)
您可以采用管道的一部分,例如
preprocess_pipeline =管道(pipeline.best_estimator_.steps [:-1])# 排除最后一步
然后
tmp = preprocess_pipeline.fit(x_train)normalized_x = tmp.fit_transform(x_train)