并行Python(pp)错误SystemError:错误返回没有异常设置

时间:2013-12-07 10:36:15

标签: python parallel-processing pickle parallel-python

我正在使用并行Python(pp库)来运行一个相当大(并且令人尴尬的并行)的工作。我的代码:

# Parallel fit with pp
def fit_label(label, trainX, trainY, label_index):
    # print 'Fitting', i, 'label out of', len(label_index)
    from sklearn.linear_model import SGDClassifier
    import numpy as np
    clf = SGDClassifier(loss='hinge', shuffle=True, alpha=0.000001, verbose=0, n_iter=5)
    temp_y = np.zeros(trainY.shape)
    temp_y[label_index[label]] = 1

    clf.fit(trainX, temp_y)
    return clf

ppservers = ()
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
jobs = [(label, job_server.submit(fit_label, args=(label, trainX, trainY, label_index), modules=('sklearn.linear_model',))) for label in label_index.keys()[0:8]]

这可以通过一个小数据集(即trainX和trainY,10,000行)顺利运行,但是当我在我的完整数据集(4mil行)上运行它(大约4GB)时,我收到此错误:

/Users/mc/.virtualenvs/kaggle/lib/python2.7/site-packages/pp.pyc in submit(self, func, args, depfuncs, modules, callback, callbackargs, group, globals)
    458 
    459         sfunc = self.__dumpsfunc((func, ) + depfuncs, modules)
--> 460         sargs = pickle.dumps(args, self.__pickle_proto)
    461 
    462         self.__queue_lock.acquire()

SystemError: error return without exception set

我想我遇到了无法处理大文件的pickle bug。我能做些什么来解决这个问题吗?我已经用multiprocessing库试了好几个小时,但从来没有让它工作过 - 我也很确定我也会遇到这个pickle问题。升级到Python3会解决这个问题吗?

In [5]: os.sys.version
Out[5]: '2.7.5 (default, Aug 25 2013, 00:04:04) \n[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'

0 个答案:

没有答案