Question

如果我在新闻类别中，

nltk.corpus.brown.tagged_words(categories="news")

我如何找到最常用的单词和单词类？我也不被允许使用FreqDist，所以这就是为什么它很难。

Answer 1

首先，使用名称空间，参见https://docs.python.org/3.5/tutorial/modules.html#importing-from-a-package，例如：

# We are not Java ;P
# Try not to do nltk.corpus.brown.tagged_words()
# Instead do this:
from nltk.corpus import brown
words_with_tags = brown.tagged_words()

接下来，nltk.probability.FreqDist本质上是原生Python collections.Counter的子类型，请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist

如果您不能使用FreqDist，您可以使用：

from collections import Counter

brown.tagged_words()的返回类型是元组列表：

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words_with_tags[0]
(u'The', u'AT')
>>> words_with_tags[:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]

要拆分元组列表，请参阅Unpacking a list / tuple of pairs into two lists / tuples：

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words, tags = zip(*words_with_tags)
>>> words[:10]
(u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of')
>>> tags[:10]
(u'AT', u'NP-TL', u'NN-TL', u'JJ-TL', u'NN-TL', u'VBD', u'NR', u'AT', u'NN', u'IN')

由于这是一个家庭作业问题，因此不会有完整的代码答案=）

Answer 2

 import nltk
 from collections import Counter
 brown = nltk.corpus.brown.tagged_words(categories="news")
 words = [word for line in brown for word in line]
 # the most frequent word class
 print Counter(words).most_common(1)

如何在特定的棕色语料库中找出最常用的单词和单词类？

2 个答案: