我训练了一个scikit-learn TfidfVectorizer
的实例,我想将它保存到磁盘上。我将IDF矩阵(idf_
属性)保存为磁盘作为numpy数组,我将词汇表(vocabulary_
)保存为磁盘作为JSON对象(为了安全起见,我避免使用pickle)其他reasons)。我试图这样做:
import json
from idf import idf # numpy array with the pre-computed IDFs
from sklearn.feature_extraction.text import TfidfVectorizer
# dirty trick so I can plug my pre-computed IDFs
# necessary because "vectorizer.idf_ = idf" doesn't work,
# it returns "AttributeError: can't set attribute."
class MyVectorizer(TfidfVectorizer):
TfidfVectorizer.idf_ = idf
# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
min_df = 2,
norm = 'l2',
smooth_idf = True)
# plug vocabulary
vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary
# test it
vectorizer.transform(['foo bar'])
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1314, in transform
return self._tfidf.transform(X, copy=False)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1014, in transform
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: idf vector is not fitted
那么,我做错了什么?我没有欺骗矢量化器对象:不知怎的,它知道我在欺骗(即传递预先计算的数据而不是用实际文本训练它)。我检查了矢量化器对象的属性,但是我找不到类似于&#39; istrained&#39;,&#39; isfitted&#39;等等。那么,我该如何欺骗矢量化器?
答案 0 :(得分:1)
好吧,我想我明白了:矢量化器实例有一个属性_tfidf
,后者必须有一个属性_idf_diag
。 transform
方法调用check_is_fitted
函数来检查是否存在_idf_diag
。 (我错过了它,因为它是属性的一个属性。)所以,我检查了TfidfVectorizer source code,看看如何创建_idf_diag
。然后我将其添加到_tfidf
属性:
import scipy.sparse as sp
# ... code ...
vectorizer._tfidf._idf_diag = sp.spdiags(idf,
diags = 0,
m = len(idf),
n = len(idf))
现在矢量化工作正常。