Python和NLTK:如何分析句子语法?

时间:2014-01-07 22:40:56

标签: python-2.7 tree nlp nltk

我有这个代码,它应该根据定义的语法显示句子的句法结构。但是它返回一个空的[]。我错过了什么或做错了什么?

import nltk

grammar = nltk.parse_cfg("""
S -> NP VP 
PP -> P NP
NP -> Det N | Det N PP 
VP -> V NP | VP PP
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
P -> 'or' | 'and'
""")

def main():
    sent = "Kim arrived or Dana left and everyone cheered".split()
    parser = nltk.ChartParser(grammar)
    trees = parser.nbest_parse(sent)
    for tree in trees:
        print tree

if __name__ == '__main__':
    main()

2 个答案:

答案 0 :(得分:11)

让我们做一些逆向工程:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
似乎规则甚至无法识别第一个作为NP的工作。所以让我们尝试注入NP -> N

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]

现在它正在运作,让我们继续Kim arrived or Dana and

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

似乎没有办法让VPP或不V,因为NP要求VP之后,或者它必须上升树在获取P之前为VP -> V PP,因此它会放松规则并说VP -> VP PP而不是>>> import nltk >>> grammar = nltk.parse_cfg(""" ... S -> NP VP ... PP -> P NP ... NP -> Det N | Det N PP | N ... VP -> V NP | V PP ... N -> 'Kim' | 'Dana' | 'everyone' ... V -> 'arrived' | 'left' |'cheered' ... P -> 'or' | 'and' ... """) >>> sent = "Kim arrived or Dana".split() >>> parser = nltk.ChartParser(grammar) >>> print parser.nbest_parse(sent) [Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]

>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

好的,我们越来越近了,但似乎下一个词再次破坏了cfg规则:

[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]

所以我希望上面的例子告诉你,试图改变规则以从左到右结合语言现象是很难的。

而不是从左到右,并实现

[[[Kim arrived] or [Dana left]] and [everyone cheered]]

为什么不尝试制定更具语言规则的声明来实现:

  1. [[Kim arrived] or [[Dana left] and [everyone cheered]]]
  2. import nltk grammar = nltk.parse_cfg(""" S -> CP | VP CP -> VP C VP | CP C VP | VP C CP VP -> NP V NP -> 'Kim' | 'Dana' | 'everyone' V -> 'arrived' | 'left' |'cheered' C -> 'or' | 'and' """) print "======= Kim arrived =========" sent = "Kim arrived".split() parser = nltk.ChartParser(grammar) for t in parser.nbest_parse(sent): print t print "\n======= Kim arrived or Dana left =========" sent = "Kim arrived or Dana left".split() parser = nltk.ChartParser(grammar) for t in parser.nbest_parse(sent): print t print "\n=== Kim arrived or Dana left and everyone cheered ====" sent = "Kim arrived or Dana left and everyone cheered".split() parser = nltk.ChartParser(grammar) for t in parser.nbest_parse(sent): print t
  3. 请改为尝试:

    [out]

    <强> ======= Kim arrived ========= (S (VP (NP Kim) (V arrived))) ======= Kim arrived or Dana left ========= (S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))) === Kim arrived or Dana left and everyone cheered ==== (S (CP (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))) (C and) (VP (NP everyone) (V cheered)))) (S (CP (VP (NP Kim) (V arrived)) (C or) (CP (VP (NP Dana) (V left)) (C and) (VP (NP everyone) (V cheered)))))

    {{1}}

    上述解决方案显示了您的CFG规则如何足够强大,不仅可以捕获完整的句子,还可以捕获句子的一部分。

答案 1 :(得分:5)

您的语法中没有定义Det,但每个NP(以及S)必须有一个语法定义。

比较
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... NP -> Det N | Det N PP
... VP -> V NP | VP PP
... Det -> 'a' | 'the'
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... """)
>>>
>>> parser = nltk.ChartParser(grammar)
>>> parser.nbest_parse('the Kim left a Dana'.split())
[Tree('S', [Tree('NP', [Tree('Det', ['the']), Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['left']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['Dana'])])])])]