如何泡菜定制矢量图?

时间:2014-02-12 01:55:13

标签: python scikit-learn pickle

我在自定义后腌制矢量图时遇到问题。

from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
tfidf_vectorizer = TfidfVectorizer(analyzer=str.split)
pickle.dump(tfidf_vectorizer, open('test.pkl', "wb"))

这导致了 “TypeError:无法pickle method_descriptor对象”

但是,如果我不自定义分析仪,它会发泡很好。关于如何解决这个问题的任何想法?如果我要更广泛地使用它,我需要保持矢量化器。

顺便说一句,我发现使用简单的字符串拆分分析器和预处理语料库来删除非词汇表和停止单词对于正常的运行速度至关重要。否则,大多数矢量化器运行时间都花费在“text.py:114(_word_ngrams)”中。 HashingVectorizer

也是如此

这与Persisting data in sklearnhttp://scikit-learn.org/0.10/tutorial.html#model-persistence有关 (顺便说一句,sklearn.externals.joblib.dump也没有帮助)

谢谢!

1 个答案:

答案 0 :(得分:3)

这不是一个像一般Python问题的scikit-learn问题:

>>> pickle.dumps(str.split)
Traceback (most recent call last):
  File "<ipython-input-7-7d3648c78b22>", line 1, in <module>
    pickle.dumps(str.split)
  File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle method_descriptor objects

解决方案是使用可拾取分析仪:

>>> def split(s):
...     return s.split()
... 
>>> pickle.dumps(split)
'c__main__\nsplit\np0\n.'
>>> tfidf_vectorizer = TfidfVectorizer(analyzer=split)
>>> type(pickle.dumps(tfidf_vectorizer))
<type 'str'>