扩展词典以包含单词频率

时间:2017-10-20 17:22:04

标签: python

我有一个python字典,我正在建立NLTK情绪分析。

注意:输入是纯文本电子邮件内容。

def word_feats(words):
    stopset = list(set(stopwords.words('english')))

    words_split = words.split()

    result = dict([(word, True) for word in words_split if word not in stopset])

    return result

我想扩展这一点,在字典中包含单词频率以及独特的单词。

这是我目前得到的:

'To' (4666843744) = {bool} True
'ensure' (4636385096) = {bool} True
'email' (4636383752) = {bool} True
'updates' (4636381960) = {bool} True
'delivered' (4667509936) = {bool} True
'inbox,' (4659135800) = {bool} True
'please' (4659137368) = {bool} True
'add' (4659135016) = {bool} True

我喜欢下面的内容,最后的数字是频率。它不一定非常像,但我希望能够访问每个单词的频率。

'To' (4666843744) = {bool} True, 100
'ensure' (4636385096) = {bool} True, 3
'email' (4636383752) = {bool} True, 40
'updates' (4636381960) = {bool} True, 3
'delivered' (4667509936) = {bool} True, 4
'inbox,' (4659135800) = {bool} True, 20
'please' (4659137368) = {bool} True, 150
'add' (4659135016) = {bool} True, 10

1 个答案:

答案 0 :(得分:3)

Python的Counter应该可以解决问题:

from collections import Counter
result = dict(Counter(word for word in words_split if word not in stopset))