Question

我有一个名为dictionary1的列表。我使用以下代码来获取文本的稀疏计数矩阵：

cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)  
cv1.fit_transform(dictionary1)

但是我注意到

list(set(dictionary1)-set(cv1.get_feature_names()))

产生['i']。因此，“ i”在我的字典中，但是CountVectorizer忽略了它（大概某些默认设置会丢弃一个字符的单词）。在documentation中找不到这样的选项。有人可以指出我的问题吗？确实，我想在分析中保留“ i”，因为它可能表示更多的个人语言。

Answer 1

一种可行的解决方法是直接将字典作为词汇传递（实际上，我不知道为什么我一开始不这样做）。即

cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=[], vocabulary=dictionary1)
cv1._validate_vocabulary()

list(set(dictionary1)-set(cv1.get_feature_names()))然后返回[]。

在我的原始帖子中，我应该提到dictionary1已经是唯一令牌的列表。

Answer 2

默认配置通过提取至少2个字母的单词来标记字符串。

查看this link，以了解有关sklearn矢量化器的更多详细信息。

在这种情况下，您应该使用其他 tokenizer ，而不是分析器。例如，您可以使用nltk库中的TweetTokenizer：

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer

corpus = [...some_texts...]

tk = TweetTokenizer()
vectorizer = CountVectorizer(tokenizer=tk.tokenize)
x = vectorizer.fit_transform(corpus)

例如，如果corpus定义如下，则将得到：

corpus = ['I love ragdolls',
          'I received a cat',
          'I take it as my best friend']

vectorizer.get_feature_names()

> ['a', 'as', 'best', 'cat', 'friend', 'i', 'it', 'love', 'my', 'ragdolls', 'received', 'take']

Python：CountVectorizer忽略一个字母单词“ I”

2 个答案: