Question

我在使用sklearn CountVectorizer时遇到一个问题，该文档包含一个单词 - “one”。我已经知道当文档只包含POS标签CD（基数）的单词时会发生错误。以下文档都导致空词汇错误： ['一二'] [ '百']

ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])

得到错误： ValueError：空词汇;也许这些文件只包含停用词

以下不会导致错误，因为（我认为）基数字与其他词混合在一起： ['一个'，'两个'，'人']

有趣的是，在这种情况下，只有'人'被添加到词汇表中，'one'，'two'没有添加：

cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}

作为单个单词文档的另一个例子，['hello']工作正常，因为它不是基数：

cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}

由于像'one'这样的单词，'two'不是停用词，我希望它们由CountVectorizer处理。我该如何处理这些词？

另外：我对“系统”一词也有同样的错误。为什么这个词会出错？

cv_array = cv.fit_transform(['system'])

ValueError：空词汇;也许这些文件只包含停用词

Answer 1

他们之所以得到空词汇是因为这些词属于sklearn使用的停用词列表。您可以查看列表here或测试：

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

>>> 'one' in ENGLISH_STOP_WORDS 
True

>>> 'two' in ENGLISH_STOP_WORDS 
True

>>> 'system' in ENGLISH_STOP_WORDS 
True

如果你想处理这些单词，只需像这样初始化你的CountVectorizer：

cv = CountVectorizer(stop_words=None, ...

CountVectorizer给出的空词汇错误是文件的基数

1 个答案: