我需要从荷兰语文本中提取subject-verb-object三元组。该文本由名为Frog的荷兰NLP工具进行分析,该工具被标记化,解析,标记,词形化,......它。 Frog生成FoLiA XML或制表符分隔的列格式输出,每个令牌一行。由于XML文件存在一些问题,我选择使用列格式。这个例子代表一个句子。 现在我需要提取每个句子的SVO三元组,因此我需要最后一列是依赖关系。所以我需要获取属于ROOT的ROOT元素和su和obj1元素。不幸的是,例句没有obj1。让我们假装它。我的想法是首先创建一个嵌套列表,每个句子都有一个列表。
import csv
with open('romanfragment_frogged.tsv','r') as f:
reader = csv.reader(f,delimiter='\t')
tokens = []
sentences = []
list_of_sents = []
for line in reader:
tokens.append(line)
#print(tokens)
for token in tokens:
if token == '1':
previous_sentence = list_of_sents
sentences.append(previous_sentence)
list_of_sents = []
list_of_sents.append(tokens)
print(list_of_sents)
当我打印'令牌'时,我会得到一个包含所有令牌的列表。所以这是正确的,但我仍然试图创建一个嵌套列表,每个句子有一个(令牌)列表。 有人可以帮我解决这个问题吗?
(P.S。第二个问题是我不确定,一旦我得到一个嵌套列表怎么继续)
答案 0 :(得分:1)
也许这样的事情可行:
def iter_sentences(fn):
with open(fn, 'r') as f:
reader = csv.reader(f,delimiter='\t')
sentence = []
for row in reader:
if not row:
# Ignore blank lines.
continue
if row[0] == '1' and sentence:
# A new sentence started.
yield sentence
sentence = []
sentence.append(row)
# Last sentence.
if sentence:
yield sentence
def iter_triples(fn):
for sentence in iter_sentences(fn):
# Get all subjects and objects.
subjects = [tok for tok in sentence if tok[-1] == 'su']
objects = [tok for tok in sentence if tok[-1] == 'obj1']
# Now try to map them: find pairs with a head in the same position.
for obj in objects:
for subj in subjects:
# row[-2] is the position of the head.
if subj[-2] == obj[-2]:
# Matching subj-obj pair found.
# Now get the verb (the head of both subj and obj).
# Its position is given in the second-to-last column.
position = int(subj[-2])
# Subtract 1, as the positions start counting at 1.
verb = sentence[position-1]
yield subj, verb, obj
for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
# Only print the surface forms.
print(subj[1], verb[1], obj[1])
快速解释:
iter_sentences
迭代句子。
每个句子都是一个嵌套列表:
它是一个令牌列表,每个令牌本身就是一个列表(包含行号,表面形式,引理,POS,依赖等)。
iter_triples
函数迭代三元组
最后三行代码只是如何使用iter_triples
函数的示例。
我不知道你需要从每个三重奏中获得多少和哪些信息...