Question

我有一个文本文件，我想匹配/查找/解析某些字符之间的所有字符（ [\ n“ 要匹配的文本 ” \ n] ）。文本本身在结构和字符方面可能有很大的不同（它们可以包含所有可能的字符）。

我以前曾发布过这个问题（对不起，重复），但是到目前为止，该问题无法解决，因此，我现在想更加精确地解决问题。

文件中的文本是这样累积的：

    test =""" 
        [
        "this is a text and its supposed to contain every possible char."
        ], 
        [
        "like *.;#]§< and many "" more."
        ], 
        [
        "plus there are even
newlines

in it."
        ]"""

我想要的输出应该是一个列表（例如），分隔符之间的每个文本都作为元素，如下所示：

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

我试图用Regex来解决它，并用我想出的相应输出来解决两个问题：

my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']

好吧，这很近。它列出了应有的前两个元素，但不幸的是没有列出第三个元素，因为其中包含换行符。

my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char."\n        ], \n        [\n        "like *.;#]§< and many "" more."\n        ], \n        [\n        "plus there are even\nnewlines\n        \n        in it.']

好吧，这次包括了每个元素，但是列表中只有一个元素，并且超前功能似乎并没有像我想的那样工作。

那么什么合适的正则表达式用来获取我想要的输出呢？为什么第二种方法不包括前瞻性？

还是有一种更干净，更快捷的方式来获取我想要的东西（美丽汤或其他方法？）？

非常感谢您的帮助和提示。

我正在使用python 3.6。

Answer 1

您应该使用DOTALL标志来匹配换行符

print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))

输出

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']

Answer 2

您可以使用模式

(?s)\[[^"]*"(.*?)"[^]"]*\]

捕获括号内"内的每个元素

：

https://regex101.com/r/SguEAU/1

然后，您可以将列表理解与re.sub一起使用，用单个普通空格替换每个捕获的子字符串中的空格字符（包括换行符）：

test ="""
    [
    "this is a text and its supposed to contain every possible char."
    ],
    [
    "like *.;#]§< and many "" more."
    ],
    [
    "plus there are even
newlines

in it."
    ]"""

output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]

结果：

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

Python-正则表达式-在某些字符之间匹配字符

2 个答案: