Question

在NLTK的ConllChunkCorpusReader课程中，有一个参数chunk_types。我希望它会返回给定文本的相对块，但我不知道这个chunk_types到底是什么。

text = '''
Mr. NNP B-NP
Meador NNP I-NP
had VBD B-VP
been VBN I-VP
executive JJ B-NP
vice NN I-NP
president NN I-NP
of IN B-PP
Balcor NNP B-NP
. . O'''

加载ConllChunkCorpusReader作为读者后，我得到如下结果。

>>> reader.chunked_sents(chunk_types='NP')
[Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), ('had', 'VBD'),
('been', 'VBN'), Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]), 
('of', 'IN'), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

但我正在寻找只有NP块的输出，如下所示。

>>> reader.chunked_sents(chunk_types='NP')
[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]),
 Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]),
 Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

Answer 1

分块树是一个最多有三个级别的树：树的根（节点S），其子项是词汇项或块;并且每个块依次是深度为1的树，其中词汇项为子项。

如果你仔细观察，你会发现你的输入有一个消失的VP块：树的顶部直接连接到词汇项('had', 'VBD')和('been', 'VBN')。这就是chunk_types所做的。

您可以通过打印或调用其draw()方法来可视化阅读器返回的树：

>>> trees = reader.chunked_sents(chunk_types='NP')
>>> print(t[0])
(S
  (NP Mr./NNP Meador/NNP)
  had/VBD
  been/VBN
  (NP executive/JJ vice/NN president/NN)
  of/IN
  (NP Balcor/NNP)
  ./.)

为什么不在ConllChunkCorpusReader中使用'chunk_types'参数？

1 个答案: