一般来说,名词短语的头部是NP的最右边的名词,如下所示,树是父NP的头部。所以
ROOT | S ___|________________________ NP | ___|_____________ | | PP VP | ____|____ ____|___ NP | NP | PRT ___|_______ | | | | DT JJ NN NN IN NNP VBD RP | | | | | | | | The old oak tree from India fell down
Out [40]:Tree('S',[Tree('NP',[Tree('NP',[Tree('DT',['The']),Tree('JJ',[' old']),Tree('NN',['oak']),Tree('NN',['tree'])]),Tree('PP',[Tree('IN',['from' ]),Tree('NP',[Tree('NNP',['India'])])])]),Tree('VP',[Tree('VBD',['fall']),树('PRT',[Tree('RP',['down'])])])])
以下代码based on a java implementation使用简单的规则来查找NP的头部,但我需要基于rules:
parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
try:
t.label()
except AttributeError:
return
else:
if t.label()=='NP':
print 'NP:'+str(t.leaves())
print 'NPhead:'+str(t.leaves()[-1])
for child in t:
traverse(child)
else:
for child in t:
traverse(child)
tree=Tree.fromstring(parsestr)
traverse(tree)
上面的代码给出了输出:
NP:[''','old','oak','tree','from','India'] NPhead:印度 NP:[''','old','oak','tree'] NPhead:树 NP: '印度'] NPhead:印度
虽然现在它给出了给出的句子的正确输出但我需要加入一个条件,只有最右边的名词被提取为头部,目前它不检查它是否是名词(NN)
print 'NPhead:'+str(t.leaves()[-1])
所以类似于上面代码中的np head条件:
t.leaves().getrightmostnoun()
Michael Collins dissertation (Appendix A)包括Penn Treebank的头部查找规则,因此没有必要只有最右边的名词才是头部。因此,上述条件应该纳入这种情况。
对于其中一个答案中给出的以下示例:
(NP(NP人))(NP谈话)回家了
主题的名词是人,但NP的最后一个离开节点是讲话的人。
答案 0 :(得分:8)
NLTK(http://www.nltk.org/_modules/nltk/tree.html)中有Tree
对象的内置字符串,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541。
>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i
...
(NP
(NP (DT The) (JJ old) (NN oak) (NN tree))
(PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i.leaves()
...
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']
请注意,并非总是最右边的名词是NP的头名词,例如
>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
... if i.label() == 'NP':
... print i.leaves()[-1]
...
Magnificent
talk
可以说,Magnificent
仍然可以是头名词。另一个例子是当NP包含相关子句时:
(NP(NP人))(NP谈话)回家了
主题的头部名词是person
,但NP the person that gave the talk
的最后一个离开节点是talk
。
答案 1 :(得分:1)
我正在寻找使用NLTK的python脚本执行此任务并偶然发现了这篇文章。这是我提出的解决方案。它有点吵和任意,绝对不总是选择正确的答案(例如复合名词)。 但是我想发布它,以防其他人有一个主要有用的解决方案。
#!/usr/bin/env python
from nltk.tree import Tree
examples = [
'(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))',
"(ROOT\n (S\n (NP\n (NP (DT the) (NN person))\n (SBAR\n (WHNP (WDT that))\n (S\n (VP (VBD gave)\n (NP (DT the) (NN talk))))))\n (VP (VBD went)\n (NP (NN home)))))",
'(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
]
def find_noun_phrases(tree):
return [subtree for subtree in tree.subtrees(lambda t: t.label()=='NP')]
def find_head_of_np(np):
noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
top_level_trees = [np[i] for i in range(len(np)) if type(np[i]) is Tree]
## search for a top-level noun
top_level_nouns = [t for t in top_level_trees if t.label() in noun_tags]
if len(top_level_nouns) > 0:
## if you find some, pick the rightmost one, just 'cause
return top_level_nouns[-1][0]
else:
## search for a top-level np
top_level_nps = [t for t in top_level_trees if t.label()=='NP']
if len(top_level_nps) > 0:
## if you find some, pick the head of the rightmost one, just 'cause
return find_head_of_np(top_level_nps[-1])
else:
## search for any noun
nouns = [p[0] for p in np.pos() if p[1] in noun_tags]
if len(nouns) > 0:
## if you find some, pick the rightmost one, just 'cause
return nouns[-1]
else:
## return the rightmost word, just 'cause
return np.leaves()[-1]
for example in examples:
tree = Tree.fromstring(example)
for np in find_noun_phrases(tree):
print "noun phrase:",
print " ".join(np.leaves())
head = find_head_of_np(np)
print "head:",
print head
对于问题和其他答案中讨论的示例,这是输出:
noun phrase: The old oak tree from India
head: tree
noun phrase: The old oak tree
head: tree
noun phrase: India
head: India
noun phrase: the person that gave the talk
head: person
noun phrase: the person
head: person
noun phrase: the talk
head: talk
noun phrase: home
head: home
noun phrase: Carnac the Magnificent
head: Magnificent
noun phrase: a talk
head: talk