词形去除和停用词去除后的频率分布

时间:2019-05-01 02:44:51

标签: python nltk

我需要读取文件mbox.txt并使用nltk.FreqDist()查找其单词频率分布,然后返回十个最常见单词的列表。但是,我需要首先:

  1. 使单词合法化
  2. 删除停用词
  3. 仅保留英语术语
  4. 仅保留属于语音的十个最频繁部分的术语。

示例输出为:

[('received', 16176), ('id', 12609), ('source', 10792), ('tue', 4498), ('mon', 3686), ('date', 3612), ('sakai', 3611), ('murder', 3594), ('cyrus', 3594), ('postfix', 3594)]

我写的代码是

import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize

tokens = nltk.word_tokenize(open('mbox.txt').read())

lmtzr = nltk.WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in word_tokenize(t)]
              for t in tokens]

fdist1 = nltk.FreqDist(tokens)
fdist1.most_common(10)

我的输出是:

[(':', 67406), ('--', 43761), (')', 40168), ('(', 40160), ('2007', 22447), ('@', 22019), (';', 21582), (',', 18632), ('from', 16328), ('by', 16231)]

我真的不知道我在做什么错。有人可以告诉我我在想什么吗?

1 个答案:

答案 0 :(得分:1)

  1. 您不会删除停用词和非英语术语
  2. 您正在检查FreqDict是否有令牌,而不是引理

尝试以下代码:

import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize

# Regex for only english terms (with dots)
ENGLISH_RE = re.compile(r'[a-z]+')

tokens = nltk.word_tokenize(open('mbox').read())

lmtzr = nltk.WordNetLemmatizer()
# Save the list between tokens
lemmatized = []
for word in tokens:
    # Lowerize for correct use in stopwords etc
    w = word.lower()
    # Check english terms
    if not ENGLISH_RE.match(w):
        continue
    # Check stopwords
    if w in stopwords.words('english'):
        continue
    lemmatized.append(lmtzr.lemmatize(w))

fdist1 = nltk.FreqDist(lemmatized)
fdist1.most_common(10)