我使用此代码删除HTML中的所有标记元素。
import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('<[^>]*>', '', MyString)
print(MyString)
输出为:
aaaRadio and television.very popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
但是现在我需要保留<br>
和<br/>
。
我希望输出像这样:
aaaRadio and television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
如何修改我的代码?
答案 0 :(得分:1)
您可以在组1中分别捕获<br>
标签,并分别捕获任何其他标签,并用\1
替换整个匹配项,以保留<br>
标签并删除其余的其他标签。替换
(?i)(<br\/?>)|<[^>]*>
与\1
。还添加了(?i)
内联修饰符(您也可以将re.IGNORECASE
作为re.sub
中的第四个参数传递,以使其不区分大小写),以使正则表达式不区分大小写,以使其与{{1} }或<BR>
您更新的Python代码,
<BR/>
仅打印带有import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br/?>)|<[^>]*>', r'\1', MyString)
print(MyString)
标签且删除了其余标签的字符串,
br
在另一种方法中,您还可以使用否定前瞻来拒绝使用此正则表达式的aaaRadio and television.<br>very<br/> popular <BR>in the <BR/>world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
标记,
br
,然后将其替换为空字符串。
Regex Demo using negative lookahead to reject
使用负前瞻正则表达式的Python代码,
(?i)<(?!br/?>)[^>]*>
打印
import re
MyString = 'aaa<p>Radio and television.<br></p><p>very<br/> popular <BR>in the <BR/>world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)<(?!br/?>)[^>]*>', r'', MyString)
print(MyString)