建议并行化scikit学习管道的predict()
方法的方法是什么?
这是一个最小的工作示例,它通过使用SVM管道和5个并行作业在虹膜数据上尝试并行predict()
来说明该问题:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.externals.joblib import Parallel, delayed
from sklearn import datasets
# Load Iris data
iris = datasets.load_iris()
# Create a pipeline with 2 steps: scaler and SVM; train
pipe = make_pipeline(StandardScaler(), SVC()).fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(pipe.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: fails
results = parallel(jobs)
此代码失败,并显示以下信息:
PicklingError: Can't pickle <function Pipeline.predict at 0x000000001746B730>: it's not the same object as sklearn.pipeline.Pipeline.predict
但是,直接将并行化应用到SVM分类器而不是管道是可行的:
# Load Iris data
iris = datasets.load_iris()
# Create SVM classifier, train
svc = SVC().fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(svc.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: works
results = parallel(jobs)
我基本上可以通过拆开管道来解决该问题:首先对整个数组应用缩放,然后像上面那样拆分成块并并行化svc.predict()
。然而,这是不便的,并且通常使管线所提供的优点无效:我必须跟踪中间结果,如果在管道中添加了额外的转换器步骤,则代码将不得不更改。
有没有一种方法可以直接使用管道并行化?
非常感谢,
Aleksey