我有一个输入文件(input.txt),其中包含一些遵循类似于以下行的标准格式的数据:
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de .
我想提取一个英文字符串列表,它位于outputfile-en.txt中的“”,以及位于outputfile-de.txt中的“@de”之间的德语字符串
在此示例中,outputfile-en.txt应包含:
Political inclusion
和outputfile-de.txt应包含:
Politische Inklusion
Radiologische Kampfmittel
哪个正则表达式适用于此?
答案 0 :(得分:2)
使用这样一个简单的模式,根本不需要正则表达式,特别是不要重复使用相同的数据来获取不同的语言 - 您可以流式解析并快速编写结果:
with open("input.txt", "r") as f: # open the input file
file_handles = {} # a map of our individual output file handles
for line in f: # read it line by line
rindex = line.rfind("@") # find the last `@` character
language = line[rindex+1:rindex+3] # grab the following two characters as language
if rindex != -1: # char found, consider the line...
lindex = line.rfind("\"", 0, rindex-1) # find the preceding quotation
if lindex != -1: # found, we have a match
if language not in file_handles: # add a file handle for this language:
file_handles[language] = open("outputfile-{}.txt".format(language), "w")
# write the found slice between `lindex` and `rindex` + a new line
file_handles[language].write(line[lindex+1:rindex-1] + "\n")
for handle in file_handles.values(): # lets close our output file handles
handle.close()
应该比正则表达式+明显更快,因为它可以使用任何语言,所以如果你有...@it
行,它也会保存outputfile-it.txt
。
答案 1 :(得分:1)
你可以这样做:
import re
str = """<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"@de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"@en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"@de . """
german = re.compile('"(.*)"@de')
english = re.compile('"(.*)"@en')
print german.findall(str)
print english.findall(str)
这会给你 ['Politische Inklusion','Radiologische Kampfmittel'] 和 ['政治包容']。 现在,您只需迭代这些结果并将它们写入相应的文件。