基于grammar in the chapter 7 of the NLTK Book:
grammar = r"""
NP: {<DT|JJ|NN.*>+} # ...
"""
我想扩展 NP (名词短语)以包括由 CC (协调连接:和)或,(逗号)来捕获名词短语,例如:
我无法获得经过修改的语法以将其捕获为单个 NP :
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
结果:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
我尝试将 NP 移至开头:NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+}
,但得到的结果相同
(S (NP The/DT house/NN) and/CC (NP tree/NN))
答案 0 :(得分:4)
让我们从小处着手,并正确捕获NP(名词短语):
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[输出]:
(S (NP The/DT house/NN) and/CC (NP tree/NN))
现在让我们尝试抓住那个and/CC
。只需添加一个重用<NP>
规则的高级短语即可:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC><NP>}
"""
sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[输出]:
(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))
现在我们捕获NP CC NP
短语,让我们看上一点,看看它是否捕获逗号:
grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC|,><NP>}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
现在,我们看到它仅限于捕获第一个左边界NP CC|, NP
并留下最后一个NP。
由于我们知道英语中的连词短语具有左边界连词和右边界NP,即CC|, NP
,例如and the tree
,我们看到CC|, NP
模式是重复的,因此我们可以将其用作中间表示。
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[输出]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP and/CC (NP tree/NN))))
最终,CNP
(连接词NP)语法捕获了英语中的链接名词短语连词,甚至是复杂的连词短语。
import nltk
grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))
[输出]:
(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP ,/, (NP the/DT green/JJ house/NN))
(XNP and/CC (NP a/DT tree/JJ)))
went/VBD
to/TO
(CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
./.)
如果您只想提取名词短语,请从How to Traverse an NLTK Tree object?:
noun_phrases = []
def traverse_tree(tree):
if tree.label() == 'CNP':
noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
traverse_tree(subtree)
return noun_phrases
sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))
[输出]:
['The house , the bear , the green house and a tree', 'the park or the river']
另外,请参见Python (NLTK) - more efficient way to extract noun phrases?