我检查过以前的相关主题,但没有解决我的问题。我编写了代码来从文本中获取NER。
text = "Stallone jason's film Rocky was inducted into the National Film Registry as well as having its film props placed in the Smithsonian Museum."
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary = True)
print namedEnt
namedEnt = nltk.ne_chunk(tagged, binary = False)
给出了这个结果的短缺
(S
(NE Stallone/NNP)
jason/NN
's/POS
film/NN
(NE Rocky/NNP)
was/VBD
inducted/VBN
into/IN
the/DT
(NE National/NNP Film/NNP Registry/NNP)
as/IN
well/RB
as/IN
having/VBG
its/PRP$
film/NN
props/NNS
placed/VBN
in/IN
the/DT
(NE Smithsonian/NNP Museum/NNP)
./.)
虽然我希望只有NE作为结果,比如
Stallone
Rockey
National Film Registry
Smithsonian Museum
如何实现这个目标?
更新
result = ' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"
print result
给出了syntext错误,写这个的正确方法是什么?
UPDATE2
text ="史泰龙杰森的电影洛基被引入国家电影注册处,并将其电影道具放在史密森尼博物馆。"
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
namedEnt = nltk.ne_chunk(tagged, binary = True)
print namedEnt
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]
print np
错误:
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]
File "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 198, in _get_node
raise NotImplementedError("Use label() to access a node label.")
NotImplementedError: Use label() to access a node label.
所以我尝试了
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.label() == "NE"]
给出了emtpy结果
答案 0 :(得分:3)
返回的namedEnt
实际上是Tree
对象,它是list
的子类。您可以执行以下操作来解析它:
[' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.node == "NE"]
输出:
['Stallone', 'Rocky', 'National Film Registry', 'Smithsonian Museum']
binary
标志设置为True
将仅指示子树是否为NE,这是我们上面需要的。当设置为False
时,它将提供更多信息,例如NE是组织,人员等。出于某种原因,标记为On和Off的结果似乎彼此不一致。