Question

我有一个大于2 GiB的分类器对象，我想腌制它，但我得到了这个：

cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)

OverflowError：无法序列化大于2 GiB的字符串

我发现this问题有同样的问题，并建议

使用 Python 3 协议4 - 不可接受因为我需要使用 Python 2
使用from pyocser import ocdumps, ocloads - 不可接受，因为我不能使用其他（非平凡）模块
将对象分成字节并挑选每个片段

有没有办法用我的分类器这样做？即将其转换为字节，拆分，pickle，unpickle，连接字节，并使用分类器？

我的代码：

from sklearn.svm import SVC 
import cPickle

def train_clf(X,y,clf_name):
    start_time = time.time()
    # after many tests, this was found to be best classifier
    clf = SVC(C = 0.01, kernel='poly')
    clf.fit(X,y)
    print 'fit done... {} seconds'.format(time.time() - start_time)
    with open(clf_name, "wb") as fo:
        cPickle.dump(clf, fo,  protocol = cPickle.HIGHEST_PROTOCOL) 
        # cPickle.HIGHEST_PROTOCOL == 2 
        # the error occurs inside the dump method
    return time.time() - start_time

在此之后，我想要破坏并使用：

with open(clf_name, 'rb') as fo:
     clf, load_time = cPickle.load(fo), time.time()

Answer 1

如果模型尺寸很大，您可以使用sklearn.external.joblib自动将模型文件拆分为pickled numpy数组文件

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')

以后可以使用以下方法进行打开：

clf = joblib.load('filename.pkl')

如何挑选文件＆gt; 2 GiB将它们分成更小的片段

1 个答案: