从NLTK.tree结果中获取实体

时间:2014-10-06 05:41:49

标签: python tree nltk

我原来的句子是

  

海啸地震也与薄薄的存在有关   沿着最上部的俯冲沉积岩层   板块界面,被认为存在于重要区域   海洋地壳顶部的地形,以及传播的地方   向上倾斜,可能到达海底。

我将这句话传给了斯坦福大学的NLP并获得了很好的解析树:

  

(RO(S(NP(NN Tsunami)(NNS地震))(VP(VBP))(ADVP(RB)   另外))(VP(VBN已))(VP(VBN连接)(PP(TO))(NP(NP(DT))   (NN存在))(PP(IN)(NP(NP(DT a)(JJ薄)(NN层))(PP   (IN)(S(VP(VBN俯冲)(NP(NP(JJ沉积)(NN岩石))(PP   (IN)(NP(NP(NP(DT)(JJS最上))(NN部分))(PP(IN)   (NP(DT)(NN板)(NN接口))))(,,)(UCP(RB as)(S(VP)   (VBZ是)(VP(VBN思想)(S(VP(TO)(VP(VB be))(ADJP(JJ)   (PP(IN))(NP(NP(NNS区域))(PP(IN))(NP(JJ   显着)(NN地形))))))(PP(IN at)(NP(NP(DT))(NN)   top))(PP(IN)(NP(DT)(JJ oceanic)(NN crust)))))))))))),()   (CC和)(SBAR(WHADVP(WRB where))(S(NP(NN传播))(VP(VBD)   ()(PP(IN)(NP(DT an)(JJ上倾)(NN方向)))(,,)(ADVP)   (RB可能)))))))))(S(VP(VBG到达)(NP(DT))(NN)   seafloor))))))))))))))(。))))

然后我在字符串上面输入NLTK.Tree:

pasrsd_tree = NLTK.Tree.fromstring(parsetree_string)

结果非常好:

  <树>树('ROOT',[树('S',[树('NP',[树('NN',['Tsunami']),   树('NNS',['地震'])]),树('VP',[树('VBP',['有']),   树('ADVP',[树('RB',['也'])]),树('VP',[树('VBN',   ['be']),树('VP',[树('VBN',['链接']),树('PP',[树('TO',   ['to']),树('NP',[树('NP',[树('DT',['the']),树('NN',   ['存在'])]),树('PP',[树('IN',['of']),树('NP',   [树('NP',[树('DT',['a']),树('JJ',['瘦']),树('NN',   ['layer'])]),Tree('PP',[Tree('IN',['of']),Tree('S',[Tree('VP',   [Tree('VBN',['subducted']),Tree('NP',[Tree('NP',[Tree('JJ',   ['沉积']),树('NN',['摇滚'])]),树('PP',[树('IN',   ['沿']),树('NP',[树('NP',[树('NP',[树('DT',['the']),   树('JJS',['最上面']),树('NN',['part'])]),树('PP',   [树('IN',['of']),树('NP',[树('DT',['the']),树('NN',   ['plate']),树('NN',['interface'])])])]),树(',',[',']),   Tree('UCP',[Tree('RB',['as']),Tree('S',[Tree('VP',[Tree('VBZ',   ['是']),树('VP',[树('VBN',['思想']),树('S',[树('VP',   [Tree('TO',['to']),Tree('VP',[Tree('VB',['be']),Tree('ADJP',   [Tree('JJ',['present']),Tree('PP',[Tree('IN',['in']),Tree('NP',   [树('NP',[树('NNS',['区域'])]),树('PP',[树('IN',['of']),   树('NP',[树('JJ',['重要']),树('NN',   ['topography'])])])])])]),树('PP',[树('IN',['at']),树('NP',   [树('NP',[树('DT',['the']),树('NN',['top'])]),树('PP',   [树('IN',['of']),树('NP',[树('DT',['the']),树('JJ',   ['oceanic']),树('NN',['crust'])])])])])])])])])])]),树(',',   [',']),树('CC',['和']),树('SBAR',[树('WHADVP',[树('WRB',   ['where'])]),树('S',[树('NP',[树('NN',['传播'])]),   树('VP',[树('VBD',['是']),树('PP',[树('IN',['in']),   树('NP',[树('DT',['an']),树('JJ',['up-dip']),树('NN',   ['方向'])])]),树(',',[',']),树('ADVP',[树('RB',   ['可能'])])])])])])])])]),树('S',[树('VP',[树('VBG',   ['到达']),树('NP',[树('DT',['the']),树('NN',   ['seafloor'])])])])])])])])])])])])])]),树('。',['。'])])])< / p>

我的问题是,鉴于pared_tree,我如何获得像top of the oceanic crusta thin layer这样的左级实体?

我在想解析树的级别可能很有用,但是在查看树级别时我真的输了,而不是怎么做。

我主要是基于Python的,Stanford NLP结果是使用Python包装器(https://bitbucket.org/torotoki/corenlp-python)获得的。

任何人都可以帮助我,并指出一些指示吗?

1 个答案:

答案 0 :(得分:2)

您可以尝试提取标记为NP的小树:

>>> from nltk import Tree
>>> parsed_tree = Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> np = [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP']
>>> np
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'areas', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'propagation', 'an up-dip direction', 'the seafloor']

但这会产生很多噪音,所以我们说没有一个单词是短语:

>>> np_mwe
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

还是很吵,让我们说一个名词短语不应该包含逗号(不是必须的,但有用的技巧):

>>> np_mwe_nocomma = [j for j in [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP'] if j.count(' ') > 0 and j.count(',') == 0]
>>> np_mwe_nocomma
['Tsunami earthquakes', 'the presence', 'a thin layer', 'sedimentary rock', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

现在我们很容易在子树中看到子树,所以让我们选择采用更大的子树:

>> x = []
>>> for i in sorted(np_mwe_nocomma, key=len, reverse=True):
...     for j in x:
...             if i in j:
...                     continue
...     print i
...     x.append(i)
... 
the uppermost part of the plate interface
areas of significant topography
the top of the oceanic crust
significant topography
Tsunami earthquakes
the plate interface
an up-dip direction
the uppermost part
the oceanic crust
sedimentary rock
the presence
a thin layer
the seafloor

我不确定这是否能满足您的需求,但您对“实体”的定义是什么?需要更加具体,否则几乎任何由解析器标记的NP都可以成为&#34;实体&#34;