Question

我有一个输入文件（input.txt），其中包含一些遵循类似于以下行的标准格式的数据：

<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de .

我想提取一个英文字符串列表，它位于outputfile-en.txt中的“”，以及位于outputfile-de.txt中的“@de”之间的德语字符串

在此示例中，outputfile-en.txt应包含：

Political inclusion

和outputfile-de.txt应包含：

Politische Inklusion
Radiologische Kampfmittel

哪个正则表达式适用于此？

Answer 1

使用这样一个简单的模式，根本不需要正则表达式，特别是不要重复使用相同的数据来获取不同的语言 - 您可以流式解析并快速编写结果：

with open("input.txt", "r") as f:  # open the input file
    file_handles = {}  # a map of our individual output file handles
    for line in f:  # read it line by line
        rindex = line.rfind("@")  # find the last `@` character
        language = line[rindex+1:rindex+3]  # grab the following two characters as language
        if rindex != -1:  # char found, consider the line...
            lindex = line.rfind("\"", 0, rindex-1)  # find the preceding quotation
            if lindex != -1:  # found, we have a match
                if language not in file_handles:  # add a file handle for this language:
                    file_handles[language] = open("outputfile-{}.txt".format(language), "w")
                # write the found slice between `lindex` and `rindex` + a new line
                file_handles[language].write(line[lindex+1:rindex-1] + "\n")
    for handle in file_handles.values():  # lets close our output file handles
        handle.close()

应该比正则表达式+明显更快，因为它可以使用任何语言，所以如果你有...@it行，它也会保存outputfile-it.txt。

Answer 2

你可以这样做：

import re

str = """<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de . """

german = re.compile('"(.*)"@de')
english = re.compile('"(.*)"@en')

print german.findall(str)
print english.findall(str)

这会给你 ['Politische Inklusion'，'Radiologische Kampfmittel'] 和 ['政治包容']。现在，您只需迭代这些结果并将它们写入相应的文件。

正则表达式从文件中提取字符串列表

2 个答案: