我有一个文本文件,我想匹配/查找/解析某些字符之间的所有字符( [\ n“ 要匹配的文本 ” \ n] )。文本本身在结构和字符方面可能有很大的不同(它们可以包含所有可能的字符)。
我以前曾发布过这个问题(对不起,重复),但是到目前为止,该问题无法解决,因此,我现在想更加精确地解决问题。
文件中的文本是这样累积的:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
我想要的输出应该是一个列表(例如),分隔符之间的每个文本都作为元素,如下所示:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
我试图用Regex来解决它,并用我想出的相应输出来解决两个问题:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
好吧,这很近。它列出了应有的前两个元素,但不幸的是没有列出第三个元素,因为其中包含换行符。
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
好吧,这次包括了每个元素,但是列表中只有一个元素,并且超前功能似乎并没有像我想的那样工作。
那么什么合适的正则表达式用来获取我想要的输出呢? 为什么第二种方法不包括前瞻性?
还是有一种更干净,更快捷的方式来获取我想要的东西(美丽汤或其他方法?)?
非常感谢您的帮助和提示。
我正在使用python 3.6。
答案 0 :(得分:1)
您应该使用DOTALL
标志来匹配换行符
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
输出
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
答案 1 :(得分:1)
您可以使用模式
(?s)\[[^"]*"(.*?)"[^]"]*\]
捕获括号内"
内的每个元素
https://regex101.com/r/SguEAU/1
然后,您可以将列表理解与re.sub
一起使用,用单个普通空格替换每个捕获的子字符串中的空格字符(包括换行符):
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
结果:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']