我有以下来自斯坦福分析师的输出:
nicaragua president ends visit to finland .
nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)
guatemala president ends visit to tropos .
nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)
[...]
我必须对此输出进行分段,以便获取包含句子的元组和所有依赖项的列表(如(sentence,[list of dependencies])
中的每个句子。有人可以建议我在Python中执行此操作的方法吗?谢谢!
答案 0 :(得分:0)
你可以做这样的事情,虽然它可能对你正在解析的结构有些过分。如果您还需要解析依赖项,那么扩展应该相对容易。我还没有运行它,甚至检查语法,所以如果它不能立即起作用,不要杀了我。
READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
state = READ_SENT
results = []
sent = None
deps = []
for line in input.splitlines():
if state == READ_SENT:
sent = line
state = PRE_DEPS
elif state == PRE_DEPS:
if line:
raise Exception('invalid format')
else:
state = DEPS
elif state == DEPS:
if line:
deps.append(line)
else:
state = POST_DEPS
elif state == POST_DEPS:
if line:
raise Exception('invalid format')
else:
results.append((sent, deps))
sent = None
deps = []
state = READ_SENT
return results