Question

我正在寻找一个正则表达式来识别模板中的块，以便我可以提供文本来替换整个块

<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>

并得到类似的东西

<div>
 mytext
</div>

Answer 1

尝试：

re.sub('\{.*[\w\s]*.*\}','mytext',txt)

输出：

'<div>\n mytext\n</div>'

\{匹配第一个大括号，然后.*[\w\s]*.*匹配所有其余大括号（包括空格和换行符），直到最后一个大括号\}。

您可以更具体地使用以下内容：

re.sub('\{% for link in links.*[\w\s]*.*end for %\}','mytext',txt)

然后你可以确定它只匹配你指定类型的for循环。

编辑：eyquem指出我的答案对于许多案件来说是不够的，特别是如果它在中间有符号。冒着天真误解为什么我的解决方案不起作用的风险，我只是在我的模式中添加了一个额外的位，甚至成功匹配他的测试用例，所以我们将看看它是否有效：

re.sub('\{.*[\W\w\s]*.*\}', 'mytext', txt)

结果（其中txt是eyquems的Pink Floyd示例）：

"Pink Floyd"
<div>
 mytext
</div>
"Fleetwood Mac"

所以，我认为添加所有非字母数字符号可以修复它。或者，对于另一个案例，我可能更明显地打破了它。我相信有人会指出来。：）'

EDIT2：还应该注意，如果页面上有多个for循环，我们的两个解决方案都会失败。例如：

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
{ for link in links % }
   asdfasdfas
{% endfor% }

产量

"Beatles"
<div>
 mytext

通过匹配下一组后来切断剩下的部分。

编辑2：eyquem再次正确修复他，如果之后有一个没有剪掉。他的修复也解决了我的问题：

re.sub('\{.*[\W\w\s]*?.*\}', 'mytext', txt)

是新模式。

Answer 2

我很遗憾地说Logan的anwer在以下情况下不起作用：

import re

ss1 = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss1
print '---------------------------'
for el in re.findall(pat,ss1):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss1)

RESULT

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee  # <--------- } here
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
---------------------------
('{% for link in links %}', '\n    aaaY', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Pink Floyd"
<div>
 :::::eee
    12345678
 :::::
</div>
"Fleetwood Mac"

。

import re

ss2 = '''"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu  # <-------- = here
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss2
print '---------------------------'
for el in re.findall(pat,ss2):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss2)

RESULT

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
---------------------------
('{% for link in links %', '', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Beatles"
<div>
 :::::
    iiiY=uuu
    12345678
 :::::
</div>
"Tino Rossi"

问题如下（ findall（）的结果放在我的代码帮助中理解）：

只要未遇到换行符，第一个.*就会运行只要存在以下类别的字符，[\w\s]*就会运行：字母，数字，下划线，空格。
在空格中有新行，然后[\w\s]*可以从一行传递到下一行但是，如果[\w\s]*遇到不在这些类别中的字符，则会在此字符处停止。

如果是}，则.*之前的''匹配}。{
} 然后正则表达式搜索下一个匹配。

如果是=，则在到达下一个.*之前，最后一个}与文本套件不匹配，因为它无法通过下一个换行符。因此，与文本中的}不同的结果。

使用.*替换.+不会改变任何内容，因为在上述代码中将.*替换为.+会看到。

我的解决方案

我在这段代码中提出了patern：

import re
pat = ('\{%[^\r\n]+%\}'
       '.+?'
       '\{%[^\r\n]+%\}')


ss = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''


print '\n',ss,'\n\n---------------------------\n'
print re.sub(pat,':::::',ss,flags=re.DOTALL)

导致

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi" 

---------------------------

"Pink Floyd"
<div>
 :::::
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 :::::
</div>
"Tino Rossi"

修改

简单：

pat = ('\{%[^}]+%\}'
       '.+?'
       '\{%[^}]+%\}')

仅当lignes {%.....%}不包含signe }

时

Answer 3

大锤的做法是：

In [540]: txt = """<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>"""

In [541]: txt
Out[541]: '<div>\n {% for link in links %}\n     textext\n {% endfor %}\n</div>'

In [542]: re.sub("(?s)<div>.*?</div>", "<div>mytext</div>", txt)
Out[542]: '<div>mytext</div>'

python用正则表达式替换字符串

3 个答案: