从HTML移除标签,但特定标签除外(但保留其内容)

时间:2019-05-06 09:01:52

标签: python regex python-3.x parsing html-parsing

我使用此代码删除HTML中的所有标记元素。我需要保留<br><br/>。 所以我使用这段代码:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
print(MyString)

输出为:

aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

结果是正确的,但现在我想保留<p></p>以及<br><br/>

如何修改我的代码?

3 个答案:

答案 0 :(得分:2)

使用HTML解析器比使用正则表达式更健壮。正则表达式不能用于解析HTML之类的嵌套结构。

这是一个有效的实现,它遍历所有HTML标记,对于那些不是df2.columns = [f'{a}_{b}' for a, b in df2.columns] df2 = df2.reset_index() print (df2) Sample Pop a1_0 a1_1 a10_0 a10_1 a100_0 a100_1 0 F295 Pesche A C A T A A 1 F296 Pesche G T C G A C 2 F297 Pesche A A G G T T 3 F298 Pesche A C A G C G p的人,将其去除标记:

br

输出:

from bs4 import BeautifulSoup

mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'

soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
    if e.name not in ['p','br']:
        e.unwrap()
print(soup)

答案 1 :(得分:0)

我不确定regex是这里的正确解决方案,但是由于您询问:

import re
html = html.replace("<p>", "{p}").replace("</p>", "{/p}")
txt = re.sub("<[^>]*>", "", html)
txt = txt.replace("{p}", "<p>").replace("{/p}", "</p>")

我实际上将p标签更改为另一个令牌,并在删除所有标签后将其替换。

使用regex解析html通常不是一个好主意。

答案 2 :(得分:0)

现在我知道如何修改。但是第一个<p>丢失了。

我的代码:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
# MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>(<\/?p>)|<[^>]*>',r'\1\2', MyString)
print(MyString)

输出为:

aaaRadio and<BR> television.<br><p>very<br/> popular in the world today.<p>Millions of people watch TV. <p>That’s because a radio is very small 98.2%</p>and it‘s easy to carry. haha100%</p>bb