Question

如何在nltk.collocations中使用finder.apply_ngram_filter来排序某些ngrams，而不是删除某些ngrams。是这样的： finder.apply_ngram_filter（lambda * w：w不在myngrams中）还是有另一种方法可以做到这一点？

有人可以帮忙吗？

Answer 1

BigramCollocationFinder最常见的用途是找到排名靠前的ngrams。 E.g。

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import genesis

# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()

# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))

# Let's rank the bigrams by their PMI score and output the top 10.
print finder2.nbest(bigram_measures.pmi, 10)
print
print finder3.nbest(bigram_measures.pmi, 10)

[OUT]：

[(u'Allon', u'Bacuth'), (u'Ashteroth', u'Karnaim'), (u'Ben', u'Ammi'), (u'En', u'Mishpat'), (u'Jegar', u'Sahadutha'), (u'Salt', u'Sea'), (u'Whoever', u'sheds'), (u'appoint', u'overseers'), (u'aromatic', u'resin'), (u'cutting', u'instrument')]

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor')]

现在我们看到finder如何工作，我们想要更复杂的功能来清除结果。让＆＃39;试着摆脱像(u'Sea', u').', u'Twelve')和(u'Valley', u').', u'Melchizedek')这些令人讨厌的三卦。

似乎在三元组中间的).通常没有给出语言上有趣的ngram，所以当我们对它们进行排名时，让我们试着去除它们：

import string
finder3.apply_ngram_filter(lambda w1, w2, w3: w2 == u').' )
print finder3.nbest(trigram_measures.pmi, 10)

[OUT]：

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart')]

似乎我们清理掉了我们不想要的trgram，但那令人讨厌的u').'进入了第三位，让我们一劳永逸地得到它：

finder3.apply_ngram_filter(lambda w1, w2, w3: u').' in [w1,w2,w3])

[OUT]：

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart'), (u'sandal', u'strap', u'nor')]

是的，现在令人讨厌的ngrams消失了。似乎我们只需要在lambda函数中给出一个条件，它将清除我们在排名时不想要的那些。

确实如此，见https://github.com/nltk/nltk/blob/develop/nltk/collocations.py#L83

  def apply_ngram_filter(self, fn):
        """Removes candidate ngrams (w1, w2, ...) where fn(w1, w2, ...)
        evaluates to True.
        """
        self._apply_filter(lambda ng, f: fn(*ng))

和此，https://github.com/nltk/nltk/blob/developa/nltk/collocations.py#L68：

def _apply_filter(self, fn=lambda ngram, freq: False):
    """Generic filter removes ngrams from the frequency distribution
    if the function returns True when passed an ngram tuple.
    """
    tmp_ngram = FreqDist()
    for ngram, freq in iteritems(self.ngram_fd):
        if not fn(ngram, freq):
            tmp_ngram[ngram] = freq
    self.ngram_fd = tmp_ngram

lambda看起来有点复杂，但实际上它只是做了类似这样的事情（它不完全做以下事情，但你可以这样理解）：

def aaply_filter(trigrams, condition):
    return [ng for ng in trigrams if not if condition]

所以，让我们回到你的问题，让我们说我们的黑名单是：

blacklist = ["olive leaf plucked", "during mating searson"]

首先你必须对它们进行tuplize（如果它们不是）：

>>> blacklist = ["olive leaf plucked", "during mating searson"]
>>> blacklist = [tuple(b.split()) for b in blacklist]
>>> blacklist
[('olive', 'leaf', 'plucked'), ('during', 'mating', 'searson')]

有了这个：

blacklist = ["olive leaf plucked", "during mating searson"]
blacklist = [tuple(b.split()) for b in blacklist]
finder3.apply_ngram_filter(lambda w1, w2, w3: (w1,w2,w3) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

[OUT]：

[(u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El')]

Voila !!!

或者，您也可以使用此功能，如果您不想对ngrams进行tuplize，则以下产生相同的输出：

blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

所以这里是完整的脚本：

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures, TrigramCollocationFinder, TrigramAssocMeasures
from nltk.corpus import genesis


# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()
trigram_measures = TrigramAssocMeasures()

# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = TrigramCollocationFinder.from_words(genesis.words('english-web.txt'))


blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

如果你想反过来，它通常会打电话给白名单，只需这样做：

whitelist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) not in whitelist)

在nltk.collocations中的finder.apply_ngram_filter来对一些ngram进行分类

1 个答案: