Question

我有以下来自斯坦福分析师的输出：

nicaragua president ends visit to finland .

nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)

guatemala president ends visit to tropos .

nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)

[...]

我必须对此输出进行分段，以便获取包含句子的元组和所有依赖项的列表（如(sentence,[list of dependencies])中的每个句子。有人可以建议我在Python中执行此操作的方法吗？谢谢！

Answer 1

你可以做这样的事情，虽然它可能对你正在解析的结构有些过分。如果您还需要解析依赖项，那么扩展应该相对容易。我还没有运行它，甚至检查语法，所以如果它不能立即起作用，不要杀了我。

READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
    state = READ_SENT
    results = []
    sent = None
    deps = []
    for line in input.splitlines():
        if state == READ_SENT:
            sent = line
            state = PRE_DEPS
        elif state == PRE_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 state = DEPS
         elif state == DEPS:
             if line:
                 deps.append(line)
             else:
                 state = POST_DEPS
         elif state == POST_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 results.append((sent, deps))
                 sent = None
                 deps = []
                 state = READ_SENT
    return results

使用正则表达式分隔文本块 - Python

1 个答案: