用nltk提取名词的语料库

时间:2018-03-31 17:20:57

标签: nlp nltk

任何人都可以告诉我如何从代码中检索名词?如果可能,请更正代码。谢谢你的帮助:)

import nltk
from nltk.corpus import state_union
from textblob import TextBlob
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import PunktSentenceTokenizer

sample_text=state_union.raw("2006-GWBush.txt")
train_text= state_union.raw("2005-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words=nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            if(pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'):
                print(tagged)
    except Exception as e:
        print(str(e))

process_content()

注意:原始代码来源https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/

1 个答案:

答案 0 :(得分:0)

对于每个句子,您会得到一个单词及其标记的列表(让我们将其称为" pos")和tagged = nltk.pos_tag(words)。例如,第一句话

  总统乔治·W·布什总统在2006年1月31日联合国大会召开之前的会议上发言。总统先生:谢谢大家。&#34 ;

你会得到:

[(u'PRESIDENT', 'NNP'), (u'GEORGE', 'NNP'), (u'W.', 'NNP'), (u'BUSH','NNP'), 
(u"'S", 'POS'), (u'ADDRESS', 'NNP'), (u'BEFORE', 'IN'), (u'A', 'NNP'), (u'JOINT', 'NNP'), 
(u'SESSION', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'CONGRESS', 'NNP'), (u'ON', 'NNP'), 
(u'THE', 'NNP'), (u'STATE', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'UNION', 'NNP'), 
(u'January', 'NNP'), (u'31', 'CD'), (u',', ','), (u'2006', 'CD'), (u'THE', 'NNP'), 
(u'PRESIDENT', 'NNP'), (u':', ':'), (u'Thank', 'NNP'), (u'you', 'PRP'), (u'all', 'DT'),
 (u'.', '.')]    

如果您想使用pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'检索所有字词,可以执行

nouns = [word for (word, pos) in tagged if pos in ['NN','NNP','NNS','NNPS']]

然后你会得到每个句子的名词列表:

[u'PRESIDENT', u'GEORGE', u'W.', u'BUSH', u'ADDRESS', u'A', u'JOINT', u'SESSION', u'THE', u'CONGRESS', u'ON', u'THE', u'STATE', u'THE', u'UNION', u'January', u'THE', u'PRESIDENT', u'Thank']