在nltk.collocations中的finder.apply_ngram_filter来对一些ngram进行分类

时间:2014-11-29 13:56:49

标签: python-2.7 python-3.x nltk

如何在nltk.collocations中使用finder.apply_ngram_filter来排序某些ngrams,而不是删除某些ngrams。 是这样的: finder.apply_ngram_filter(lambda * w:w不在myngrams中) 还是有另一种方法可以做到这一点?

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

BigramCollocationFinder最常见的用途是找到排名靠前的ngrams。 E.g。

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import genesis

# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()

# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))

# Let's rank the bigrams by their PMI score and output the top 10.
print finder2.nbest(bigram_measures.pmi, 10)
print
print finder3.nbest(bigram_measures.pmi, 10)

[OUT]:

[(u'Allon', u'Bacuth'), (u'Ashteroth', u'Karnaim'), (u'Ben', u'Ammi'), (u'En', u'Mishpat'), (u'Jegar', u'Sahadutha'), (u'Salt', u'Sea'), (u'Whoever', u'sheds'), (u'appoint', u'overseers'), (u'aromatic', u'resin'), (u'cutting', u'instrument')]

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor')]

现在我们看到finder如何工作,我们想要更复杂的功能来清除结果。让'试着摆脱像(u'Sea', u').', u'Twelve')(u'Valley', u').', u'Melchizedek')这些令人讨厌的三卦。

似乎在三元组中间的).通常没有给出语言上有趣的ngram,所以当我们对它们进行排名时,让我们试着去除它们:

import string
finder3.apply_ngram_filter(lambda w1, w2, w3: w2 == u').' )
print finder3.nbest(trigram_measures.pmi, 10)

[OUT]:

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart')]

似乎我们清理掉了我们不想要的trgram,但那令人讨厌的u').'进入了第三位,让我们一劳永逸地得到它:

finder3.apply_ngram_filter(lambda w1, w2, w3: u').' in [w1,w2,w3])

[OUT]:

[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart'), (u'sandal', u'strap', u'nor')]

是的,现在令人讨厌的ngrams消失了。似乎我们只需要在lambda函数中给出一个条件,它将清除我们在排名时不想要的那些。

确实如此,见https://github.com/nltk/nltk/blob/develop/nltk/collocations.py#L83

  def apply_ngram_filter(self, fn):
        """Removes candidate ngrams (w1, w2, ...) where fn(w1, w2, ...)
        evaluates to True.
        """
        self._apply_filter(lambda ng, f: fn(*ng))

和此,https://github.com/nltk/nltk/blob/developa/nltk/collocations.py#L68

def _apply_filter(self, fn=lambda ngram, freq: False):
    """Generic filter removes ngrams from the frequency distribution
    if the function returns True when passed an ngram tuple.
    """
    tmp_ngram = FreqDist()
    for ngram, freq in iteritems(self.ngram_fd):
        if not fn(ngram, freq):
            tmp_ngram[ngram] = freq
    self.ngram_fd = tmp_ngram

lambda看起来有点复杂,但实际上它只是做了类似这样的事情(它不完全做以下事情,但你可以这样理解):

def aaply_filter(trigrams, condition):
    return [ng for ng in trigrams if not if condition]

所以,让我们回到你的问题,让我们说我们的黑名单是:

blacklist = ["olive leaf plucked", "during mating searson"]

首先你必须对它们进行tuplize(如果它们不是):

>>> blacklist = ["olive leaf plucked", "during mating searson"]
>>> blacklist = [tuple(b.split()) for b in blacklist]
>>> blacklist
[('olive', 'leaf', 'plucked'), ('during', 'mating', 'searson')]

有了这个:

blacklist = ["olive leaf plucked", "during mating searson"]
blacklist = [tuple(b.split()) for b in blacklist]
finder3.apply_ngram_filter(lambda w1, w2, w3: (w1,w2,w3) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

[OUT]:

[(u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El')]

Voila !!!

或者,您也可以使用此功能,如果您不想对ngrams进行tuplize,则以下产生相同的输出:

blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

所以这里是完整的脚本:

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures, TrigramCollocationFinder, TrigramAssocMeasures
from nltk.corpus import genesis


# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()
trigram_measures = TrigramAssocMeasures()

# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = TrigramCollocationFinder.from_words(genesis.words('english-web.txt'))


blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)

print finder3.nbest(trigram_measures.pmi, 10)

如果你想反过来,它通常会打电话给白名单,只需这样做:

whitelist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) not in whitelist)