任何人都可以告诉我如何从代码中检索名词?如果可能,请更正代码。谢谢你的帮助:)
import nltk
from nltk.corpus import state_union
from textblob import TextBlob
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import PunktSentenceTokenizer
sample_text=state_union.raw("2006-GWBush.txt")
train_text= state_union.raw("2005-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words=nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
if(pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'):
print(tagged)
except Exception as e:
print(str(e))
process_content()
注意:原始代码来源https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
答案 0 :(得分:0)
对于每个句子,您会得到一个单词及其标记的列表(让我们将其称为" pos")和tagged = nltk.pos_tag(words)
。例如,第一句话
总统乔治·W·布什总统在2006年1月31日联合国大会召开之前的会议上发言。总统先生:谢谢大家。&#34 ;你会得到:
[(u'PRESIDENT', 'NNP'), (u'GEORGE', 'NNP'), (u'W.', 'NNP'), (u'BUSH','NNP'),
(u"'S", 'POS'), (u'ADDRESS', 'NNP'), (u'BEFORE', 'IN'), (u'A', 'NNP'), (u'JOINT', 'NNP'),
(u'SESSION', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'CONGRESS', 'NNP'), (u'ON', 'NNP'),
(u'THE', 'NNP'), (u'STATE', 'NNP'), (u'OF', 'IN'), (u'THE', 'NNP'), (u'UNION', 'NNP'),
(u'January', 'NNP'), (u'31', 'CD'), (u',', ','), (u'2006', 'CD'), (u'THE', 'NNP'),
(u'PRESIDENT', 'NNP'), (u':', ':'), (u'Thank', 'NNP'), (u'you', 'PRP'), (u'all', 'DT'),
(u'.', '.')]
如果您想使用pos =='NN' or pos == 'NNP' or pos =='NNS' or pos=='NNPS'
检索所有字词,可以执行
nouns = [word for (word, pos) in tagged if pos in ['NN','NNP','NNS','NNPS']]
然后你会得到每个句子的名词列表:
[u'PRESIDENT', u'GEORGE', u'W.', u'BUSH', u'ADDRESS', u'A', u'JOINT', u'SESSION', u'THE', u'CONGRESS', u'ON', u'THE', u'STATE', u'THE', u'UNION', u'January', u'THE', u'PRESIDENT', u'Thank']