Question

我知道nltk可以告诉你在给定环境中单词的可能性nltk language model (ngram) calculate the prob of a word from context

但它可以告诉你布朗语料库中给定ngram的计数（或可能性）吗？例如，它能告诉你棕色语料库中出现“巧克力奶昔”这个短语的次数吗？

我知道你可以用google ngrams这样做，但数据有点笨拙。我想知道是否有办法用简单的NLTK做到这一点。

Answer 1

from collections import Counter

from nltk.corpus import brown
from nltk.util import ngrams

n = 2
bigrams = ngrams(brown.words(), n)
bigrams_freq = Counter(bigrams)

print bigrams_freq[('chocolate', 'milkshake')]
print bigrams_freq.most_common()[2000]

[OUT]：

0
(('beginning', 'of'), 42)

Answer 2

使用nltk.bigrams(<tokenizedtext>)，很容易计算它们。制作一个空字典，遍历双字母列表，并添加或更新每个双字母组的计数（字典将为{<bigram>: <count>}形式）。获得此词典后，只需使用dict[<bigram>]

查找您感兴趣的任何二元组

一个例子，假设棕色标记位于列表brown_bigrams中：

frequencies = {}
for ngram in brown_bigrams:
    if ngram in frequencies:
        frequencies[ngram] += 1
    else:
        frequencies[ngram] = 1

#frequency of ('chocolate', 'milkshake')
print frequencies[('chocolate', 'milkshake')]

棕色新闻语料库中的ngram计数？

2 个答案: