如何将NLP解析树拆分为子句(独立和从属)?

时间:2016-09-04 18:10:51

标签: nlp nltk grammar stanford-nlp clause

给出像

这样的NLP解析树
(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))

原来的句子是“你可以说他们经常洗澡,这增加了他们的兴奋和生活乐趣。”

如何提取和逆向设计条款? 我们将分裂为S和SBAR(以保留子句的类型,例如从属)

 - (S (NP (PRP You)) (VP (MD could) (VP (VB say) 
 - (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower))
 - (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to)
   (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW
   de) (FW vivre))))))))))))) (. .)))

到达

 - You could say
 - that they regularly catch a shower 
 - , which adds to their exhilaration and joie de vivre.

分裂S和SBAR似乎很容易。问题似乎是从片段中剥离掉所有POS标签和块。

2 个答案:

答案 0 :(得分:8)

您可以使用Tree.subtrees()。有关详细信息,请查看NLTK Tree Class

<强>代码:

from nltk import Tree

parse_str = "(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))"
#parse_str = "(ROOT (S (SBAR (IN Though) (S (NP (PRP he)) (VP (VBD was) (ADJP (RB very) (JJ rich))))) (, ,) (NP (PRP he)) (VP (VBD was) (ADVP (RB still)) (ADJP (RB very) (JJ unhappy))) (. .)))"

t = Tree.fromstring(parse_str)
#print t

subtexts = []
for subtree in t.subtrees():
    if subtree.label()=="S" or subtree.label()=="SBAR":
        #print subtree.leaves()
        subtexts.append(' '.join(subtree.leaves()))
#print subtexts

presubtexts = subtexts[:]       # ADDED IN EDIT for leftover check

for i in reversed(range(len(subtexts)-1)):
    subtexts[i] = subtexts[i][0:subtexts[i].index(subtexts[i+1])]

for text in subtexts:
    print text

# ADDED IN EDIT - Not sure for generalized cases
leftover = presubtexts[0][presubtexts[0].index(presubtexts[1])+len(presubtexts[1]):]
print leftover

<强>输出:

You could say 
that 
they regularly catch a shower , 
which 
adds to their exhilaration and joie de vivre
 .

答案 1 :(得分:0)

首先得到解析树:

# stanza.install_corenlp()

from stanza.server import CoreNLPClient

text = "Joe realized that the train was late while he waited at the train station"

with CoreNLPClient(
        annotators=['tokenize', 'pos', 'lemma', 'parse', 'depparse'],
        output_format="json",
        timeout=30000,
        memory='16G') as client:
    output = client.annotate(text)
    # print(output.sentence[0])
    parse_tree = output['sentences'][0]['parse']
    parse_tree = ' '.join(parse_tree.split())

然后使用这个 gist 通过调用来提取子句:

print_clauses(parse_str=parse_tree)

输出将是:

{'the train was late', 'he waited at the train station', 'Joe realized'}