Question

我有一个很长的字符串，里面有嵌套循环。我想在其中提取一个模式。

字符串文本：

some random texts......
........................
........................
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}} // End of topmost parenthesis
........................
..again some random text
........................
........................ // can also contain {{  }}
......End of string.

我想提取最上括号之间的所有文本，即

提取字符串：

info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................

模式：

1。）以 {开头，后跟任意数量的 {。

2。）之后，可以有任意数量的空白。

3。）之后的第一个词肯定是 info 。

4。）提取直到未关闭此支架。

到目前为止已经尝试了什么：

re.findall(r'\{+[^\S\r\n]*info\s*(.*(?:\r?\n.*)*)\}+')

我知道这是错误的，因为这样做是在字符串中找到} 的最后一个实例。有人可以帮我提取这些括号之间的文字吗？ TIA

Answer 1

您需要使用递归方法：

{
    ((?:[^{}]|(?R))*)
}

仅新的regex模块支持此功能，请参见a demo on regex101.com。

Answer 2

变通模式可以是与以{{info开头的行匹配，然后匹配任何0+字符的，直到只有}}的行为止。

re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)

请参见regex demo。

详细信息

(?sm)-re.DOTALL（现在.匹配换行符）和re.MULTILINE（^现在匹配行首，$匹配行结束位置）标志
^-一行的开头
{{-一个{{子字符串
[^\S\r\n]*-0+个水平空格
info-子字符串
\s*-超过0个空格
(.*?)-第1组：任意0个以上的字符，尽可能少
^}}$-行的开头，}}至行的结尾。

Answer 3

This answer解释了如何使用递归操作（尽管是圆括号，但很容易适应），但是，就我个人而言，我只是使用while循环编写它：

b = 1
i = si = s.index('{')
i += 1
while b:
    if s[i] == '{': b += 1
    elif s[i] == '}': b -=1
    i += 1

ss = s[si:i]

其中，您的字符串定义为：s，将子字符串ss设置为：

>>> print(ss)
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}}

正则表达式在嵌套括号之间查找文本

3 个答案: