Python Nltk:ngrams过滤

时间:2014-06-22 16:41:07

标签: python nltk

我有这个文本,我想知道怎么做,因为我尝试了很多方法来过滤ngrams

>>> bigrams_list = ('Hi', 'ya'), ('See', 'you'), ('My', 'name'), ...;
>>> trigrams_list = ('It', 'is', 'fine'), ('See', 'you', 'tomorrow'), ('My','surname', 'is')]
>>> fourgrams_list = ('It', 'is', 'fine', 'thanks'), ('Bla', 'bla', 'bla', 
'bla'), ('Attention','to', 'the', 'words'), ....

所以从新的三元组列表中我可以排除"('看','你','明天')&# 34 ;, 从新的四个图块中我可以排除"('它''是','罚款','感谢' ;)"等等。 任何建议

1 个答案:

答案 0 :(得分:3)

只需使用list[:-1]检查双字母的三元组:

>>> bigrams = [('hi', 'ya'), ('see', 'you'), ('my', 'name')]
>>> trigrams = [('it', 'is', 'fine'), ('see', 'you', 'tomorrow'), ('my', 'name', 'is')]
>>> fourgrams = [('it', 'is', 'fine', 'thanks'), ('blah', 'blah', 'blah', 'blah'), ('what', 'sort', 'of', 'question'), ('is', 'this', 'any', 'ways'), ('please', 'read', 'SO', 'FAQ'), ('before', 'posting', 'questions', 'here')]
>>> filtered_trigrams = [i for i in trigrams if i[:-1] not in bigrams]
>>> filtered_trigrams
[('it', 'is', 'fine')]
>>> filtered_fourgrams = [i for i in fourgrams if i[:-1] not in trigrams]
>>> filtered_fourgrams
[('blah', 'blah', 'blah', 'blah'), ('what', 'sort', 'of', 'question'), ('is', 'this', 'any', 'ways'), ('please', 'read', 'SO', 'FAQ'), ('before', 'posting', 'questions', 'here')]

除非来自不同语料库的双字母组和三元组,否则过滤任何东西都是不现实的。因为来自同一文本的所有三元组都包括它的双字母等等,所以Ngrams和N-1grams都是如此:

>>> from nltk import word_tokenize
>>> from nltk.util import ngrams
>>> text = """hi ya. see you tomorrow. it is fine, thank you. my name is blah blah blah. attention to the words..."""
>>> list(ngrams(word_tokenize(text), 2))
[('hi', 'ya.'), ('ya.', 'see'), ('see', 'you'), ('you', 'tomorrow.'), ('tomorrow.', 'it'), ('it', 'is'), ('is', 'fine'), ('fine', ','), (',', 'thank'), ('thank', 'you.'), ('you.', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'blah'), ('blah', 'blah'), ('blah', 'blah.'), ('blah.', 'attention'), ('attention', 'to'), ('to', 'the'), ('the', 'words'), ('words', '...')]
>>> bigrams = list(ngrams(word_tokenize(text), 2))
>>> trigrams = list(ngrams(word_tokenize(text), 3))
>>> fourgrams = list(ngrams(word_tokenize(text), 4))
>>> [i for i in trigrams if i[:-1] not in bigrams]
[]
>>> [i for i in fourgrams if i[:-1] not in trigrams]
[]
>>> [i for i in fourgrams if i[:-2] not in bigrams]
[]
>>> len([i for i in trigrams if i[:-1] in bigrams]) == len(trigrams)
True