在NLTK中,获取三元组的出现次数

时间:2016-07-26 18:36:35

标签: python nltk

我想从文本中获取“常用短语”,定义为不止一次出现的三元组。直到现在我有了这个:

import nltk

def get_words(string):
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(string)

string = "Hello, world. This is a dog. This is a cat."

words = get_words(string)

finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.TrigramAssocMeasures().raw_freq)

结果scored

[(('This', 'is', 'a'), 0.2), (('Hello', 'world', 'This'), 0.1), (('a', 'dog', 'This'), 0.1), (('dog', 'This', 'is'), 0.1), (('is', 'a', 'cat'), 0.1), (('is', 'a', 'dog'), 0.1), (('world', 'This', 'is'), 0.1)]

我注意到scored元素中的数字是三元组的出现次数除以总字数(在本例中为10)。有没有办法直接获得出现次数,而没有按字数“后乘”?

3 个答案:

答案 0 :(得分:1)

您可以使用 finder.ngram_fd.items()

获取出现次数
# To get Trigrams with occurrences
trigrams = finder.ngram_fd.items()
print trigrams

# To get Trigrams with occurrences in descending order
trigrams = sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))
print trigrams

您可以在以下网址查看更多相关示例:NLTK Collocations

答案 1 :(得分:0)

要获得规范化的频率,您只需调用ngram_fd即可。 在你的情况下:

trigram_freqs = finder.ngram_fd

答案 2 :(得分:0)

最后,我使用{post-multiplying'raw_freq属性,因为它已经排序了。这是我的实施:

import nltk

def get_words(string):
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(string)

string = "Hello, world. This is a dog. This is a cat."

words = get_words(string)
word_count = len(words)

finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.TrigramAssocMeasures().raw_freq)
scored_common = filter(lambda score: score[1]*word_count > 1, scored)
common_phrases = [" ".join(score[0]) for score in scored_common]

此示例为['This is a']生成常用短语。