我是scikit learn的新手。我试图做tfidf矢量化以适应1 * M numpy.array即tot_data(在下面的代码中),由英语句子组成。 这里'words'是一个numpy.array(1 * 173),包含停用词列表。 我需要显式定义参数stop_words。 如果我没有明确使用参数stop_words,代码运行正常,但下面的行显示错误。
word = numpy.array(['a','about',...])
>>> vectorizer = TfidfVectorizer(max_df=.95,stop_words=word).fit(tot_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1203, in fit
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 710, in _count_vocab
analyze = self.build_analyzer()
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 225, in build_analyzer
stop_words = self.get_stop_words()
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 208, in get_stop_words
return _check_stop_list(self.stop_words)
File "/usr/local/python2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 85, in _check_stop_list
if stop == "english":
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
答案 0 :(得分:3)
原因:错误的原因是numpy数组将比较传播到元素:
>>> word == 'english'
array([False, False, False], dtype=bool)
和if
语句无法将结果数组转换为布尔值:
>>> if word == 'english': pass
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
解决方案:将字词转换为普通列表:words = list(words)
。
<强>演示:强>
>>> import numpy as np
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> word = np.array(['one','two','three'])
>>> tot_data = np.array(['one two three', 'who do I see', 'I see two girls'])
>>> v = TfidfVectorizer(max_df=.95,stop_words=list(word))
>>> v.fit(tot_data)
TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
...
tokenizer=None, use_idf=True, vocabulary=None)