我正在尝试注释纯文本语料库。我正在使用系统功能语法,这在词性注释方面是相当标准的,但在短语/块方面有所不同。
因此,我使用NLTK默认值标记了我的数据,并使用nltk.RegexpParser
创建了一个正则表达式块。基本上,输出现在是NLTK样式的短语结构树:
Tree('S',[Tree('Clause',[Tree('Process-dependencies', [Tree('Participant',[('This','DT')]),Tree('Verbal-group',[('is', 'VBZ')]),树('参与者',[('a','DT'),('表示', 'NN')]),Tree('Circumstance',[('of','IN'),('the','DT'), ('grammar','NN')])])]),('。','。')])
有些东西我想在此基础上手动注释,但是:系统语法将参与者和口头组分解为可能无法自动注释的子类型。因此,我希望将解析树格式转换为注释工具(最好是BRAT)可以处理的内容,然后通过文本并手动指定子类型,如(一种可能的解决方案):
也许解决方案会欺骗BRAT来处理像依赖关系这样的短语结构?如果需要,我可以修改分块正则表达式。那里有转换器吗? (Brat提供了从CONLL2000和Stanford Core NLP转换的方法,所以如果我能将短语结构变成这两种形式中的任何一种,那也是可以接受的。)
谢谢!
答案 0 :(得分:2)
将非二叉树表示为弧将很困难,但可以嵌套“实体”注释并将其用于选区解析结构。请注意,我不是为树的终端(词性标签)创建节点,部分原因是Brat目前不擅长显示通常适用于终端的一元规则。找到目标格式的描述here。
首先,我们需要一个函数来产生对峙注释。虽然Brat在字符方面寻求对峙,但在下文中我们只使用令牌偏移,并将转换为下面的字符。
(注意这使用NLTK 3.0b和Python 3)
def _standoff(path, leaves, slices, offset, tree):
width = 0
for i, child in enumerate(tree):
if isinstance(child, tuple):
tok, tag = child
leaves.append(tok)
width += 1
else:
path.append(i)
width += _standoff(path, leaves, slices, offset + width, child)
path.pop()
slices.append((tuple(path), tree.label(), offset, offset + width))
return width
def standoff(tree):
leaves = []
slices = []
_standoff([], leaves, slices, 0, tree)
return leaves, slices
将此应用于您的示例:
>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
[((0, 0, 0), 'Participant', 0, 1),
((0, 0, 1), 'Verbal-group', 1, 2),
((0, 0, 2), 'Participant', 2, 4),
((0, 0, 3), 'Circumstance', 4, 7),
((0, 0), 'Process-dependencies', 0, 7),
((0,), 'Clause', 0, 7),
((), 'S', 0, 8)])
返回叶子标记,然后是元素列表,对应子元素与元素:(索引到根,标签,开始叶子,停止叶子)。
将此转换为角色对峙:
def char_standoff(tree):
leaves, tok_standoff = standoff(tree)
text = ' '.join(leaves)
# Map leaf index to its start and end character
starts = []
offset = 0
for leaf in leaves:
starts.append(offset)
offset += len(leaf) + 1
starts.append(offset)
return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
for path, label, start_tok, end_tok in tok_standoff]
然后:
>>> char_standoff(tree)
('This is a representation of the grammar .',
[((0, 0, 0), 'Participant', 0, 4),
((0, 0, 1), 'Verbal-group', 5, 7),
((0, 0, 2), 'Participant', 8, 24),
((0, 0, 3), 'Circumstance', 25, 39),
((0, 0), 'Process-dependencies', 0, 39),
((0,), 'Clause', 0, 39),
((), 'S', 0, 41)])
最后,我们可以编写一个将其转换为Brat格式的函数:
def write_brat(tree, filename_prefix):
text, standoff = char_standoff(tree)
with open(filename_prefix + '.txt', 'w') as f:
print(text, file=f)
with open(filename_prefix + '.ann', 'w') as f:
for i, (path, label, start, stop) in enumerate(standoff):
print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)
这将以下内容写入 /path/to/something.txt :
This is a representation of the grammar .
和 /path/to/something.ann :
T0 Participant 0 4 This
T1 Verbal-group 5 7 is
T2 Participant 8 24 a representation
T3 Circumstance 25 39 of the grammar
T4 Process-dependencies 0 39 This is a representation of the grammar
T5 Clause 0 39 This is a representation of the grammar
T6 S 0 41 This is a representation of the grammar .