Question

我试图通过使用内置方法在文本中找到与NLTK的搭配。

现在我有以下示例文本（ test 和 foo 互相关注，但中间有句子边框）：

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test. foo 5"""

标记化和collocations()的结果如下：

print nltk.word_tokenize(content_part)
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.',
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5']

print nltk.Text(nltk.word_tokenize(content_part)).collocations()
# test. foo

如何阻止NLTK：

在我的标记化中包含点
在句子边框上找不到搭配（）？

所以在这个例子中它根本不应该打印任何搭配，但我想你可以设想更复杂的文本，其中句子内也有搭配。

我可以猜测我需要使用 Punkt句子分段器，但后来我不知道如何将它们再次组合起来找到与nltk的搭配（collocation()似乎更多强大而不仅仅是自己计算东西。）

Answer 1

您可以使用WordPunctTokenizer将标点与单词分开，然后使用apply_word_filter（）过滤掉带有标点符号的双字母组。

同样的事情可以用于三元组，而不是在句子边界上找到搭配。

from nltk import bigrams
from nltk import collocations
from nltk import FreqDist
from nltk.collocations import *
from nltk import WordPunctTokenizer

content_part = """test. foo 0 test. foo 1 test. 
               foo 2 test. foo 3 test. foo 4 test, foo 4 test."""

tokens = WordPunctTokenizer().tokenize(content_part)

bigram_measures = collocations.BigramAssocMeasures()
word_fd = FreqDist(tokens)
bigram_fd = FreqDist(bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)

finder.apply_word_filter(lambda w: w in ('.', ','))

scored = finder.score_ngrams(bigram_measures.raw_freq)

print tokens
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

输出：

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.']
[('4', 'test'), ('foo', '4')]

Python nltk：查找没有点分隔单词的搭配

1 个答案: