如何在nltk.collocations中使用finder.apply_ngram_filter来排序某些ngrams,而不是删除某些ngrams。 是这样的: finder.apply_ngram_filter(lambda * w:w不在myngrams中) 还是有另一种方法可以做到这一点?
有人可以帮忙吗?
答案 0 :(得分:0)
BigramCollocationFinder
最常见的用途是找到排名靠前的ngrams。 E.g。
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import genesis
# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()
# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
# Let's rank the bigrams by their PMI score and output the top 10.
print finder2.nbest(bigram_measures.pmi, 10)
print
print finder3.nbest(bigram_measures.pmi, 10)
[OUT]:
[(u'Allon', u'Bacuth'), (u'Ashteroth', u'Karnaim'), (u'Ben', u'Ammi'), (u'En', u'Mishpat'), (u'Jegar', u'Sahadutha'), (u'Salt', u'Sea'), (u'Whoever', u'sheds'), (u'appoint', u'overseers'), (u'aromatic', u'resin'), (u'cutting', u'instrument')]
[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor')]
现在我们看到finder
如何工作,我们想要更复杂的功能来清除结果。让'试着摆脱像(u'Sea', u').', u'Twelve')
和(u'Valley', u').', u'Melchizedek')
这些令人讨厌的三卦。
似乎在三元组中间的).
通常没有给出语言上有趣的ngram,所以当我们对它们进行排名时,让我们试着去除它们:
import string
finder3.apply_ngram_filter(lambda w1, w2, w3: w2 == u').' )
print finder3.nbest(trigram_measures.pmi, 10)
[OUT]:
[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart')]
似乎我们清理掉了我们不想要的trgram,但那令人讨厌的u').'
进入了第三位,让我们一劳永逸地得到它:
finder3.apply_ngram_filter(lambda w1, w2, w3: u').' in [w1,w2,w3])
[OUT]:
[(u'olive', u'leaf', u'plucked'), (u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Their', u'hearts', u'failed'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El'), (u'own', u'droves', u'apart'), (u'sandal', u'strap', u'nor')]
是的,现在令人讨厌的ngrams消失了。似乎我们只需要在lambda函数中给出一个条件,它将清除我们在排名时不想要的那些。
确实如此,见https://github.com/nltk/nltk/blob/develop/nltk/collocations.py#L83
def apply_ngram_filter(self, fn):
"""Removes candidate ngrams (w1, w2, ...) where fn(w1, w2, ...)
evaluates to True.
"""
self._apply_filter(lambda ng, f: fn(*ng))
和此,https://github.com/nltk/nltk/blob/developa/nltk/collocations.py#L68:
def _apply_filter(self, fn=lambda ngram, freq: False):
"""Generic filter removes ngrams from the frequency distribution
if the function returns True when passed an ngram tuple.
"""
tmp_ngram = FreqDist()
for ngram, freq in iteritems(self.ngram_fd):
if not fn(ngram, freq):
tmp_ngram[ngram] = freq
self.ngram_fd = tmp_ngram
lambda看起来有点复杂,但实际上它只是做了类似这样的事情(它不完全做以下事情,但你可以这样理解):
def aaply_filter(trigrams, condition):
return [ng for ng in trigrams if not if condition]
所以,让我们回到你的问题,让我们说我们的黑名单是:
blacklist = ["olive leaf plucked", "during mating searson"]
首先你必须对它们进行tuplize(如果它们不是):
>>> blacklist = ["olive leaf plucked", "during mating searson"]
>>> blacklist = [tuple(b.split()) for b in blacklist]
>>> blacklist
[('olive', 'leaf', 'plucked'), ('during', 'mating', 'searson')]
有了这个:
blacklist = ["olive leaf plucked", "during mating searson"]
blacklist = [tuple(b.split()) for b in blacklist]
finder3.apply_ngram_filter(lambda w1, w2, w3: (w1,w2,w3) in blacklist)
print finder3.nbest(trigram_measures.pmi, 10)
[OUT]:
[(u'rider', u'falls', u'backward'), (u'sewed', u'fig', u'leaves'), (u'yield', u'royal', u'dainties'), (u'during', u'mating', u'season'), (u'Salt', u'Sea', u').'), (u'Sea', u').', u'Twelve'), (u'Their', u'hearts', u'failed'), (u'Valley', u').', u'Melchizedek'), (u'doing', u'forced', u'labor'), (u'El', u'Beth', u'El')]
Voila !!!
或者,您也可以使用此功能,如果您不想对ngrams进行tuplize,则以下产生相同的输出:
blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)
print finder3.nbest(trigram_measures.pmi, 10)
所以这里是完整的脚本:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures, TrigramCollocationFinder, TrigramAssocMeasures
from nltk.corpus import genesis
# Initialize an association measure for bigrams.
bigram_measures = BigramAssocMeasures()
trigram_measures = TrigramAssocMeasures()
# Puts the corpus into a BigramCollocationFinder class.
# Now you can search for bigrams in the corpus
finder2 = BigramCollocationFinder.from_words(genesis.words('english-web.txt'))
finder3 = TrigramCollocationFinder.from_words(genesis.words('english-web.txt'))
blacklist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) in blacklist)
print finder3.nbest(trigram_measures.pmi, 10)
如果你想反过来,它通常会打电话给白名单,只需这样做:
whitelist = ["olive leaf plucked", "during mating searson"]
finder3.apply_ngram_filter(lambda w1, w2, w3: " ".join([w1,w2,w3]) not in whitelist)