如何在特定的棕色语料库中找出最常用的单词和单词类?

时间:2016-03-16 22:34:19

标签: python nltk

如果我在新闻类别中,

nltk.corpus.brown.tagged_words(categories="news")

我如何找到最常用的单词和单词类? 我也不被允许使用FreqDist,所以这就是为什么它很难。

2 个答案:

答案 0 :(得分:0)

首先,使用名称空间,参见https://docs.python.org/3.5/tutorial/modules.html#importing-from-a-package,例如:

# We are not Java ;P
# Try not to do nltk.corpus.brown.tagged_words()
# Instead do this:
from nltk.corpus import brown
words_with_tags = brown.tagged_words()

接下来,nltk.probability.FreqDist本质上是原生Python collections.Counter的子类型,请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist

如果您不能使用FreqDist,您可以使用:

from collections import Counter

brown.tagged_words()的返回类型是元组列表:

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words_with_tags[0]
(u'The', u'AT')
>>> words_with_tags[:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]

要拆分元组列表,请参阅Unpacking a list / tuple of pairs into two lists / tuples

>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words, tags = zip(*words_with_tags)
>>> words[:10]
(u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of')
>>> tags[:10]
(u'AT', u'NP-TL', u'NN-TL', u'JJ-TL', u'NN-TL', u'VBD', u'NR', u'AT', u'NN', u'IN')

由于这是一个家庭作业问题,因此不会有完整的代码答案=)

答案 1 :(得分:0)

 import nltk
 from collections import Counter
 brown = nltk.corpus.brown.tagged_words(categories="news")
 words = [word for line in brown for word in line]
 # the most frequent word class
 print Counter(words).most_common(1)