根据寻找NP头部的规则,在NLTK和斯坦福解析中找到一个名词短语的头部

时间:2015-09-18 14:38:38

标签: python algorithm tree nltk stanford-nlp

一般来说,名词短语的头部是NP的最右边的名词,如下所示,树是父NP的头部。所以

            ROOT                             
             |                                
             S                               
          ___|________________________        
         NP                           |      
      ___|_____________               |       
     |                 PP             VP     
     |             ____|____      ____|___    
     NP           |         NP   |       PRT 
  ___|_______     |         |    |        |   
 DT  JJ  NN  NN   IN       NNP  VBD       RP 
 |   |   |   |    |         |    |        |   
The old oak tree from     India fell     down

Out [40]:Tree('S',[Tree('NP',[Tree('NP',[Tree('DT',['The']),Tree('JJ',[' old']),Tree('NN',['oak']),Tree('NN',['tree'])]),Tree('PP',[Tree('IN',['from' ]),Tree('NP',[Tree('NNP',['India'])])])]),Tree('VP',[Tree('VBD',['fall']),树('PRT',[Tree('RP',['down'])])])])

以下代码based on a java implementation使用简单的规则来查找NP的头部,但我需要基于rules

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)

        else:
            for child in t:
                traverse(child)


tree=Tree.fromstring(parsestr)
traverse(tree)

上面的代码给出了输出:

NP:[''','old','oak','tree','from','India'] NPhead:印度 NP:[''','old','oak','tree'] NPhead:树 NP: '印度'] NPhead:印度

虽然现在它给出了给出的句子的正确输出但我需要加入一个条件,只有最右边的名词被提取为头部,目前它不检查它是否是名词(NN)

print 'NPhead:'+str(t.leaves()[-1])

所以类似于上面代码中的np head条件:

t.leaves().getrightmostnoun() 

Michael Collins dissertation (Appendix A)包括Penn Treebank的头部查找规则,因此没有必要只有最右边的名词才是头部。因此,上述条件应该纳入这种情况。

对于其中一个答案中给出的以下示例:

  

(NP(NP人))(NP谈话)回家了

主题的名词是人,但NP的最后一个离开节点是讲话的人。

2 个答案:

答案 0 :(得分:8)

NLTK(http://www.nltk.org/_modules/nltk/tree.html)中有Tree对象的内置字符串,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))


>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']

请注意,并非总是最右边的名词是NP的头名词,例如

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk

可以说,Magnificent仍然可以是头名词。另一个例子是当NP包含相关子句时:

  

(NP(NP人))(NP谈话)回家了

主题的头部名词是person,但NP the person that gave the talk的最后一个离开节点是talk

答案 1 :(得分:1)

我正在寻找使用NLTK的python脚本执行此任务并偶然发现了这篇文章。这是我提出的解决方案。它有点吵和任意,绝对不总是选择正确的答案(例如复合名词)。 但是我想发布它,以防其他人有一个主要有用的解决方案。

#!/usr/bin/env python

from nltk.tree import Tree

examples = [
    '(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))',
    "(ROOT\n  (S\n    (NP\n      (NP (DT the) (NN person))\n      (SBAR\n        (WHNP (WDT that))\n        (S\n          (VP (VBD gave)\n            (NP (DT the) (NN talk))))))\n    (VP (VBD went)\n      (NP (NN home)))))",
    '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
]

def find_noun_phrases(tree):
    return [subtree for subtree in tree.subtrees(lambda t: t.label()=='NP')]

def find_head_of_np(np):
    noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
    top_level_trees = [np[i] for i in range(len(np)) if type(np[i]) is Tree]
    ## search for a top-level noun
    top_level_nouns = [t for t in top_level_trees if t.label() in noun_tags]
    if len(top_level_nouns) > 0:
        ## if you find some, pick the rightmost one, just 'cause
        return top_level_nouns[-1][0]
    else:
        ## search for a top-level np
        top_level_nps = [t for t in top_level_trees if t.label()=='NP']
        if len(top_level_nps) > 0:
            ## if you find some, pick the head of the rightmost one, just 'cause
            return find_head_of_np(top_level_nps[-1])
        else:
            ## search for any noun
            nouns = [p[0] for p in np.pos() if p[1] in noun_tags]
            if len(nouns) > 0:
                ## if you find some, pick the rightmost one, just 'cause
                return nouns[-1]
            else:
                ## return the rightmost word, just 'cause
                return np.leaves()[-1]

for example in examples:
    tree = Tree.fromstring(example)
    for np in find_noun_phrases(tree):
        print "noun phrase:",
        print " ".join(np.leaves())
        head = find_head_of_np(np)
        print "head:",
        print head

对于问题和其他答案中讨论的示例,这是输出:

noun phrase: The old oak tree from India
head: tree
noun phrase: The old oak tree
head: tree
noun phrase: India
head: India
noun phrase: the person that gave the talk
head: person
noun phrase: the person
head: person
noun phrase: the talk
head: talk
noun phrase: home
head: home
noun phrase: Carnac the Magnificent
head: Magnificent
noun phrase: a talk
head: talk