我需要读取文件mbox.txt并使用nltk.FreqDist()
查找其单词频率分布,然后返回十个最常见单词的列表。但是,我需要首先:
示例输出为:
[('received', 16176), ('id', 12609), ('source', 10792), ('tue', 4498), ('mon', 3686), ('date', 3612), ('sakai', 3611), ('murder', 3594), ('cyrus', 3594), ('postfix', 3594)]
我写的代码是
import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
tokens = nltk.word_tokenize(open('mbox.txt').read())
lmtzr = nltk.WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in word_tokenize(t)]
for t in tokens]
fdist1 = nltk.FreqDist(tokens)
fdist1.most_common(10)
我的输出是:
[(':', 67406), ('--', 43761), (')', 40168), ('(', 40160), ('2007', 22447), ('@', 22019), (';', 21582), (',', 18632), ('from', 16328), ('by', 16231)]
我真的不知道我在做什么错。有人可以告诉我我在想什么吗?
答案 0 :(得分:1)
尝试以下代码:
import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
# Regex for only english terms (with dots)
ENGLISH_RE = re.compile(r'[a-z]+')
tokens = nltk.word_tokenize(open('mbox').read())
lmtzr = nltk.WordNetLemmatizer()
# Save the list between tokens
lemmatized = []
for word in tokens:
# Lowerize for correct use in stopwords etc
w = word.lower()
# Check english terms
if not ENGLISH_RE.match(w):
continue
# Check stopwords
if w in stopwords.words('english'):
continue
lemmatized.append(lmtzr.lemmatize(w))
fdist1 = nltk.FreqDist(lemmatized)
fdist1.most_common(10)