为文本挖掘创建词汇词典

时间:2017-01-17 13:59:20

标签: python nlp text-mining

我有以下代码:

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

现在我试着像这样计算单词频率:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

接下来我想打印voculabary。所以我这样做:

vectorizer.fit_transform(train_set)
print vectorizer.vocabulary

现在我得到的输出没有。虽然我期待类似的东西:

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

任何想法出错的地方?

2 个答案:

答案 0 :(得分:4)

CountVectorizer并不支持您所寻找的内容。

您可以使用Counter类:

from collections import Counter

train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
    word_counter.update(s.split())

print(word_counter)

给出

Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})

或者您可以使用nltk中的FreqDist

from nltk import FreqDist

train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
    word_dist.update(s.split())

print(dict(word_dist))

给出

{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}

答案 1 :(得分:3)

我想你可以试试这个:

print vectorizer.vocabulary_