我有一个大于2 GiB的分类器对象,我想腌制它,但我得到了这个:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
OverflowError:无法序列化大于2 GiB的字符串
我发现this问题有同样的问题,并建议
from pyocser import ocdumps, ocloads
- 不可接受,因为我不能使用其他(非平凡)模块有没有办法用我的分类器这样做?即将其转换为字节,拆分,pickle,unpickle,连接字节,并使用分类器?
我的代码:
from sklearn.svm import SVC
import cPickle
def train_clf(X,y,clf_name):
start_time = time.time()
# after many tests, this was found to be best classifier
clf = SVC(C = 0.01, kernel='poly')
clf.fit(X,y)
print 'fit done... {} seconds'.format(time.time() - start_time)
with open(clf_name, "wb") as fo:
cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)
# cPickle.HIGHEST_PROTOCOL == 2
# the error occurs inside the dump method
return time.time() - start_time
在此之后,我想要破坏并使用:
with open(clf_name, 'rb') as fo:
clf, load_time = cPickle.load(fo), time.time()
答案 0 :(得分:2)
如果模型尺寸很大,您可以使用sklearn.external.joblib自动将模型文件拆分为pickled numpy数组文件
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
以后可以使用以下方法进行打开:
clf = joblib.load('filename.pkl')