joblib与pickle有什么不同的用例?

时间:2012-09-27 06:39:46

标签: python pickle scikit-learn

背景:我刚刚开始使用scikit-learn,并在页面底部阅读joblib, versus pickle

  

使用joblib替换pickle(joblib.dump& joblib.load)可能更有趣,这对大数据更有效,但只能腌制到磁盘而不是字符串

我在Pickle上读到了这个Q& A, Common use-cases for pickle in Python并想知道这里的社区是否可以分享joblib和pickle之间的差异?应该何时使用另一个?

4 个答案:

答案 0 :(得分:42)

对于大型numpy数组,joblib通常要快得多,因为它对numpy数据结构的数组缓冲区有一个特殊的处理。要查找实施细节,您可以查看source code。它还可以在使用zlib或lz4进行酸洗时动态压缩该数据。

joblib还可以在加载时对内存映射未压缩的joblib-pickled numpy数组的数据缓冲区进行内存映射,从而可以在进程之间共享内存。

请注意,如果你没有腌制大型numpy数组,那么常规pickle可以明显更快,尤其是在大型python对象集合(例如str对象的大型dict)上,因为标准库的pickle模块已实现在C中,而joblib是纯python。

请注意,一旦PEP 574(Pickle协议5)被合并(希望用于Python 3.8),使用标准库来挑选大型numpy数组会更有效。

joblib在使用mmap_mode="r"的内存映射模式下加载具有嵌套numpy数组的对象时可能仍然有用。

答案 1 :(得分:10)

感谢Gunjan给我们这个剧本!我为Python3结果修改了它

#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib

file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)

t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)

t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)

time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891

答案 2 :(得分:5)

我遇到了同样的问题,所以我尝试了这个(使用Python 2.7),因为我需要加载一个大的pickle文件

#comapare pickle loaders
from time import time
import pickle
import os
try:
   import cPickle
except:
   print "Cannot import cPickle"
import joblib

t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1

t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1

t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1

输出为

time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392

根据这个joblib比这3个模块中的cPickle和Pickle模块更好。感谢

答案 3 :(得分:0)

只是一个谦虚的笔记...... Pickle 更适合拟合 scikit-learn 估计器/训练模型。在 ML 应用程序中,训练好的模型被保存和加载备份主要用于预测。