我在自定义后腌制矢量图时遇到问题。
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
tfidf_vectorizer = TfidfVectorizer(analyzer=str.split)
pickle.dump(tfidf_vectorizer, open('test.pkl', "wb"))
这导致了 “TypeError:无法pickle method_descriptor对象”
但是,如果我不自定义分析仪,它会发泡很好。关于如何解决这个问题的任何想法?如果我要更广泛地使用它,我需要保持矢量化器。
顺便说一句,我发现使用简单的字符串拆分分析器和预处理语料库来删除非词汇表和停止单词对于正常的运行速度至关重要。否则,大多数矢量化器运行时间都花费在“text.py:114(_word_ngrams)”中。 HashingVectorizer
也是如此这与Persisting data in sklearn和http://scikit-learn.org/0.10/tutorial.html#model-persistence有关 (顺便说一句,sklearn.externals.joblib.dump也没有帮助)
谢谢!
答案 0 :(得分:3)
这不是一个像一般Python问题的scikit-learn问题:
>>> pickle.dumps(str.split)
Traceback (most recent call last):
File "<ipython-input-7-7d3648c78b22>", line 1, in <module>
pickle.dumps(str.split)
File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle method_descriptor objects
解决方案是使用可拾取分析仪:
>>> def split(s):
... return s.split()
...
>>> pickle.dumps(split)
'c__main__\nsplit\np0\n.'
>>> tfidf_vectorizer = TfidfVectorizer(analyzer=split)
>>> type(pickle.dumps(tfidf_vectorizer))
<type 'str'>