从预处理文本

时间:2017-05-31 08:45:04

标签: python syntax nlp triples

我需要从荷兰语文本中提取subject-verb-object三元组。该文本由名为Frog的荷兰NLP工具进行分析,该工具被标记化,解析,标记,词形化,......它。 Frog生成FoLiA XML或制表符分隔的列格式输出,每个令牌一行。由于XML文件存在一些问题,我选择使用列格式。这个例子代表一个句子。 enter image description here现在我需要提取每个句子的SVO三元组,因此我需要最后一列是依赖关系。所以我需要获取属于ROOT的ROOT元素和su和obj1元素。不幸的是,例句没有obj1。让我们假装它。我的想法是首先创建一个嵌套列表,每个句子都有一个列表。

    import csv
    with open('romanfragment_frogged.tsv','r') as f:
         reader = csv.reader(f,delimiter='\t')
         tokens = []
         sentences = []
         list_of_sents = []
         for line in reader:
             tokens.append(line)
             #print(tokens)
             for token in tokens:
                 if token == '1':
                    previous_sentence = list_of_sents
                    sentences.append(previous_sentence)
         list_of_sents = []
         list_of_sents.append(tokens)
         print(list_of_sents)

当我打印'令牌'时,我会得到一个包含所有令牌的列表。所以这是正确的,但我仍然试图创建一个嵌套列表,每个句子有一个(令牌)列表。 有人可以帮我解决这个问题吗?

(P.S。第二个问题是我不确定,一旦我得到一个嵌套列表怎么继续)

1 个答案:

答案 0 :(得分:1)

也许这样的事情可行:

def iter_sentences(fn):
    with open(fn, 'r') as f:
         reader = csv.reader(f,delimiter='\t')
         sentence = []
         for row in reader:
             if not row:
                # Ignore blank lines.
                continue
             if row[0] == '1' and sentence:
                 # A new sentence started.
                 yield sentence
                 sentence = []
             sentence.append(row)
         # Last sentence.
         if sentence:
             yield sentence

def iter_triples(fn):
    for sentence in iter_sentences(fn):
        # Get all subjects and objects.
        subjects = [tok for tok in sentence if tok[-1] == 'su']
        objects = [tok for tok in sentence if tok[-1] == 'obj1']
        # Now try to map them: find pairs with a head in the same position.
        for obj in objects:
            for subj in subjects:
                # row[-2] is the position of the head.
                if subj[-2] == obj[-2]:
                    # Matching subj-obj pair found.
                    # Now get the verb (the head of both subj and obj).
                    # Its position is given in the second-to-last column.
                    position = int(subj[-2])
                    # Subtract 1, as the positions start counting at 1.
                    verb = sentence[position-1]
                    yield subj, verb, obj

for subj, verb, obj in iter_triples('romanfragment_frogged.tsv'):
    # Only print the surface forms.
    print(subj[1], verb[1], obj[1])

快速解释: iter_sentences迭代句子。 每个句子都是一个嵌套列表: 它是一个令牌列表,每个令牌本身就是一个列表(包含行号,表面形式,引理,POS,依赖等)。 iter_triples函数迭代三元组。 这些三元组的每个元素代表一个标记(即一个列表,再次)。

最后三行代码只是如何使用iter_triples函数的示例。 我不知道你需要从每个三重奏中获得多少和哪些信息...