我想从文本中获取“常用短语”,定义为不止一次出现的三元组。直到现在我有了这个:
import nltk
def get_words(string):
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
return tokenizer.tokenize(string)
string = "Hello, world. This is a dog. This is a cat."
words = get_words(string)
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.TrigramAssocMeasures().raw_freq)
结果scored
是
[(('This', 'is', 'a'), 0.2), (('Hello', 'world', 'This'), 0.1), (('a', 'dog', 'This'), 0.1), (('dog', 'This', 'is'), 0.1), (('is', 'a', 'cat'), 0.1), (('is', 'a', 'dog'), 0.1), (('world', 'This', 'is'), 0.1)]
我注意到scored
元素中的数字是三元组的出现次数除以总字数(在本例中为10)。有没有办法直接获得出现次数,而没有按字数“后乘”?
答案 0 :(得分:1)
您可以使用 finder.ngram_fd.items()
获取出现次数# To get Trigrams with occurrences
trigrams = finder.ngram_fd.items()
print trigrams
# To get Trigrams with occurrences in descending order
trigrams = sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))
print trigrams
您可以在以下网址查看更多相关示例:NLTK Collocations
答案 1 :(得分:0)
要获得规范化的频率,您只需调用ngram_fd即可。 在你的情况下:
trigram_freqs = finder.ngram_fd
答案 2 :(得分:0)
最后,我使用{post-multiplying'raw_freq
属性,因为它已经排序了。这是我的实施:
import nltk
def get_words(string):
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
return tokenizer.tokenize(string)
string = "Hello, world. This is a dog. This is a cat."
words = get_words(string)
word_count = len(words)
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.TrigramAssocMeasures().raw_freq)
scored_common = filter(lambda score: score[1]*word_count > 1, scored)
common_phrases = [" ".join(score[0]) for score in scored_common]
此示例为['This is a']
生成常用短语。