我正在使用并行Python(pp库)来运行一个相当大(并且令人尴尬的并行)的工作。我的代码:
# Parallel fit with pp
def fit_label(label, trainX, trainY, label_index):
# print 'Fitting', i, 'label out of', len(label_index)
from sklearn.linear_model import SGDClassifier
import numpy as np
clf = SGDClassifier(loss='hinge', shuffle=True, alpha=0.000001, verbose=0, n_iter=5)
temp_y = np.zeros(trainY.shape)
temp_y[label_index[label]] = 1
clf.fit(trainX, temp_y)
return clf
ppservers = ()
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
jobs = [(label, job_server.submit(fit_label, args=(label, trainX, trainY, label_index), modules=('sklearn.linear_model',))) for label in label_index.keys()[0:8]]
这可以通过一个小数据集(即trainX和trainY,10,000行)顺利运行,但是当我在我的完整数据集(4mil行)上运行它(大约4GB)时,我收到此错误:
/Users/mc/.virtualenvs/kaggle/lib/python2.7/site-packages/pp.pyc in submit(self, func, args, depfuncs, modules, callback, callbackargs, group, globals)
458
459 sfunc = self.__dumpsfunc((func, ) + depfuncs, modules)
--> 460 sargs = pickle.dumps(args, self.__pickle_proto)
461
462 self.__queue_lock.acquire()
SystemError: error return without exception set
我想我遇到了无法处理大文件的pickle bug。我能做些什么来解决这个问题吗?我已经用multiprocessing
库试了好几个小时,但从来没有让它工作过 - 我也很确定我也会遇到这个pickle
问题。升级到Python3会解决这个问题吗?
In [5]: os.sys.version
Out[5]: '2.7.5 (default, Aug 25 2013, 00:04:04) \n[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'