使用joblib将scikit-learn管道与Forecast()并行化

时间:2018-07-05 21:58:39

标签: python scikit-learn joblib

建议并行化scikit学习管道的predict()方法的方法是什么?

这是一个最小的工作示例,它通过使用SVM管道和5个并行作业在虹膜数据上尝试并行predict()来说明该问题:

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.externals.joblib import Parallel, delayed
from sklearn import datasets

# Load Iris data
iris = datasets.load_iris()
# Create a pipeline with 2 steps: scaler and SVM; train
pipe = make_pipeline(StandardScaler(), SVC()).fit(X=iris.data, y=iris.target)

# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]

# Setup 5 parallel jobs 
jobs = (delayed(pipe.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)

# Run jobs: fails
results = parallel(jobs)

此代码失败,并显示以下信息:

PicklingError: Can't pickle <function Pipeline.predict at 0x000000001746B730>: it's not the same object as sklearn.pipeline.Pipeline.predict 

但是,直接将并行化应用到SVM分类器而不是管道是可行的:

# Load Iris data
iris = datasets.load_iris()
# Create SVM classifier, train
svc = SVC().fit(X=iris.data, y=iris.target)

# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]

# Setup 5 parallel jobs 
jobs = (delayed(svc.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)

# Run jobs: works
results = parallel(jobs)

我基本上可以通过拆开管道来解决该问题:首先对整个数组应用缩放,然后像上面那样拆分成块并并行化svc.predict()。然而,这是不便的,并且通常使管线所提供的优点无效:我必须跟踪中间结果,如果在管道中添加了额外的转换器步骤,则代码将不得不更改。

有没有一种方法可以直接使用管道并行化?

非常感谢,

Aleksey

0 个答案:

没有答案