如果我在新闻类别中,
nltk.corpus.brown.tagged_words(categories="news")
我如何找到最常用的单词和单词类? 我也不被允许使用FreqDist,所以这就是为什么它很难。
答案 0 :(得分:0)
首先,使用名称空间,参见https://docs.python.org/3.5/tutorial/modules.html#importing-from-a-package,例如:
# We are not Java ;P
# Try not to do nltk.corpus.brown.tagged_words()
# Instead do this:
from nltk.corpus import brown
words_with_tags = brown.tagged_words()
接下来,nltk.probability.FreqDist
本质上是原生Python collections.Counter
的子类型,请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist
如果您不能使用FreqDist,您可以使用:
from collections import Counter
brown.tagged_words()
的返回类型是元组列表:
>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words_with_tags[0]
(u'The', u'AT')
>>> words_with_tags[:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]
要拆分元组列表,请参阅Unpacking a list / tuple of pairs into two lists / tuples:
>>> from nltk.corpus import brown
>>> words_with_tags = brown.tagged_words()
>>> words, tags = zip(*words_with_tags)
>>> words[:10]
(u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of')
>>> tags[:10]
(u'AT', u'NP-TL', u'NN-TL', u'JJ-TL', u'NN-TL', u'VBD', u'NR', u'AT', u'NN', u'IN')
由于这是一个家庭作业问题,因此不会有完整的代码答案=)
答案 1 :(得分:0)
import nltk
from collections import Counter
brown = nltk.corpus.brown.tagged_words(categories="news")
words = [word for line in brown for word in line]
# the most frequent word class
print Counter(words).most_common(1)