我有以下代码:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
现在我试着像这样计算单词频率:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
接下来我想打印voculabary。所以我这样做:
vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
现在我得到的输出没有。虽然我期待类似的东西:
{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
任何想法出错的地方?
答案 0 :(得分:4)
CountVectorizer
并不支持您所寻找的内容。
您可以使用Counter
类:
from collections import Counter
train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
word_counter.update(s.split())
print(word_counter)
给出
Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})
或者您可以使用nltk中的FreqDist
:
from nltk import FreqDist
train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
word_dist.update(s.split())
print(dict(word_dist))
给出
{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}
答案 1 :(得分:3)
我想你可以试试这个:
print vectorizer.vocabulary_