我正在做一些文档分类工作,并使用sklearn的散列矢量化器,然后进行tfidf转换。如果Tfidf参数保留为默认值,我没有问题。但是,如果我设置sublinear_tf=True
,则会引发以下错误:
ValueError Traceback (most recent call last)
<ipython-input-16-137f187e99d8> in <module>()
----> 5 tfidf.transform(test)
D:\Users\DB\Anaconda\lib\site-packages\sklearn\feature_extraction\text.pyc in transform(self, X, copy)
1020
1021 if self.norm:
-> 1022 X = normalize(X, norm=self.norm, copy=False)
1023
1024 return X
D:\Users\DB\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in normalize(X, norm, axis, copy)
533 raise ValueError("'%d' is not a supported axis" % axis)
534
--> 535 X = check_arrays(X, sparse_format=sparse_format, copy=copy)[0]
536 warn_if_not_float(X, 'The normalize function')
537 if axis == 0:
D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
272 if not allow_nans:
273 if hasattr(array, 'data'):
--> 274 _assert_all_finite(array.data)
275 else:
276 _assert_all_finite(array.values())
D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X)
41 and not np.isfinite(X).all()):
42 raise ValueError("Input contains NaN, infinity"
---> 43 " or a value too large for %r." % X.dtype)
44
45
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
我找到了导致错误的文本的最小样本并尝试了一些诊断:
hv_stops = HashingVectorizer(ngram_range=(1,2), preprocessor=neg_preprocess, stop_words='english')
tfidf = TfidfTransformer(sublinear_tf=True).fit(hv_stops.transform(X))
test = hv_stops.transform(X[4:6])
print np.any(np.isnan(test.todense())) #False
print np.any(np.isinf(test.todense())) #False
print np.all(np.isfinite(test.todense())) #True
tfidf.transform(test) #Raises the ValueError
有关导致错误的原因的任何想法?如果需要更多信息,请告诉我。提前谢谢!
编辑:
这个单一文字项对我造成错误:
hv_stops = HashingVectorizer(ngram_range=(1,3), stop_words='english', non_negative=True)
item = u'b number b number b number conclusion no product_neg was_neg returned_neg for_neg evaluation_neg review of the medd history records did not find_neg any_neg deviations_neg or_neg anomalies_neg it is not suspected_neg that_neg the_neg product_neg failed_neg to_neg meet_neg specifications_neg the investigation could not verify_neg or_neg identify_neg any_neg evidence_neg of_neg a_neg medd_neg deficiency_neg causing_neg or_neg contributing_neg to_neg the_neg reported_neg problem_neg based on the investigation the need for corrective action is not indicated_neg should additional information be received that changes this conclusion an amended medd report will be filed zimmer considers the investigation closed this mdr is being submitted late as this issue was identified during a retrospective review of complaint files '
li = [item]
fail = hv_stops.transform(li)
TfidfTransformer(sublinear_tf=True).fit_transform(fail)
答案 0 :(得分:4)
我找到了原因。 TfidfTransformer
假设它获得的稀疏矩阵是规范的,即它的data
成员中不包含实际的零。但是,HashingVectorizer
生成一个包含存储零的稀疏矩阵。这会导致对数变换产生-inf
,这反过来会导致规范化失败,因为矩阵具有无限范数。
这是scikit-learn中的一个错误;我做了report,但我还不确定修复是什么。