我希望每个人都做得很好。
我一直在关注使用NLTK的SentDex的Youtube教程,目的是创建名称识别程序。正如您从下面的代码中看到的那样,我已经成功地实现了' chunk'名。但是,我想要做的就是把所有的' chunked'将名称命名为数组,以便我可以轻松选择名称。这可能吗?如果不是,还有另一种方式吗?
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
namedEnt=""
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged,binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))
process_content()
答案 0 :(得分:0)
获取每个句子的标签,您可以使用Tree.pos()
并按第二个元素过滤该列表,'NE'表示命名实体。
def process_content():
names = []
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged,binary=True)
tags = namedEnt.pos()
names.append([x[0][0] for x in tags if x[1] == 'NE'])
except Exception as e:
print(str(e))
return names