如何使用不想在python中删除的正则表达式保留字符?

时间:2019-04-30 09:18:32

标签: python regex python-3.x

我使用此代码删除HTML中的所有标记元素。

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('<[^>]*>', '', MyString)
print(MyString)

输出为:

aaaRadio and television.very popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

但是现在我需要保留<br><br/>

我希望输出像这样:

aaaRadio and television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

如何修改我的代码?

1 个答案:

答案 0 :(得分:1)

您可以在组1中分别捕获<br>标签,并分别捕获任何其他标签,并用\1替换整个匹配项,以保留<br>标签并删除其余的其他标签。替换

(?i)(<br\/?>)|<[^>]*>

\1。还添加了(?i)内联修饰符(您也可以将re.IGNORECASE作为re.sub中的第四个参数传递,以使其不区分大小写),以使正则表达式不区分大小写,以使其与{{1} }或<BR>

Regex Demo

您更新的Python代码,

<BR/>

仅打印带有import re MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb' MyString = re.sub('(?i)(<br/?>)|<[^>]*>', r'\1', MyString) print(MyString) 标签且删除了其余标签的字符串,

br

在另一种方法中,您还可以使用否定前瞻来拒绝使用此正则表达式的aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb 标记,

br

,然后将其替换为空字符串。

Regex Demo using negative lookahead to reject

使用负前瞻正则表达式的Python代码,

(?i)<(?!br/?>)[^>]*>

打印

import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)<(?!br/?>)[^>]*>', r'', MyString)
print(MyString)