当n_jobs = -1时,WordPunctTokenizer和sklearn GridSearchCV PicklingError

时间:2017-11-17 19:47:26

标签: scikit-learn nltk python-multiprocessing

我正在尝试使用scikit-learn gridsearchcv解决问题,并且在sklearn的GridSearchCV方法中设置PicklingError时,n_jobs=-1会出错。我在搜索论坛寻找解决方案,但找不到有用的东西。似乎错误在多处理模块中,并且无法在并行处理环境中序列化对象。我这里没有使用任何自定义类,只是标准的sklearn变换器和估算器。任何帮助解决这个问题将非常感激。感谢。

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from nltk.tokenize import WordPunctTokenizer

X = np.asarray(['This is a sample text',
                'Here is another one',
                'How about this?',
                'Some random text again',
                'Need to make this work',
                'sklearn is awesome!',
                'Adding more train data',
                'and more and more data'           
               ])

y = np.asarray([1, 2, 2, 1, 1, 1, 2, 1])

pipeline = Pipeline([
               ('vectorizer', TfidfVectorizer(ngram_range=(1, 3), tokenizer=WordPunctTokenizer().tokenize, stop_words='english')),
               ('estimator', LogisticRegression(class_weight='balanced'))
           ])

param_grid = dict()
param_grid['vectorizer__sublinear_tf'] = [True, False]
param_grid['vectorizer__smooth_idf'] = [True, False]
param_grid['vectorizer__norm'] = ['l1', 'l2']
param_grid['estimator__penalty'] = ['l1', 'l2']

grid_clf = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring='f1_micro')
grid_clf.fit(X, y)

print "\nBest parameters:", grid_clf.best_params_
print "Best score:", grid_clf.best_score_, "\n"

错误

Fitting 3 folds for each of 16 candidates, totalling 48 fits
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-17-04b1aa29a4ee> in <module>()
     30 
     31 grid_clf = GridSearchCV(pipeline, param_grid, verbose=2, n_jobs=-1, scoring='f1_micro')
---> 32 grid_clf.fit(X, y)
     33 
     34 print "\nBest parameters:", grid_clf.best_params_
....

PicklingError: Can't pickle <type 'instancemethod'>: it's not found as __builtin__.instancemethod

0 个答案:

没有答案