BeautifulSoup取消评论ID的评论

时间:2020-05-28 19:35:50

标签: python beautifulsoup

我想使用BeautifulSoup将下面的html更改为通过注释标签ID取消注释。

<div class="foo">
 cat dog sheep goat
 <!--<p id="p1">test</p>-->
 <p id="p2">
  test
 </p>
</div>

这是我下面的预期结果:

<div class="foo">
 cat dog sheep goat
 <p id="p1">test</p>
 <p id="p2">
  test
 </p>
</div>

这是我使用BeautifulSoup的python代码,但我不知道如何完成此功能。

from bs4 import BeautifulSoup,Comment

data = """<div class="foo">
cat dog sheep goat
<p id='p1'>test</p>
<p id='p2'>test</p>
</div>"""
soup = BeautifulSoup(data, 'html.parser')

for comment in soup(text=lambda text: isinstance(text, Comment)):
    if 'id="p1"' in comment.string: 
        # I don't know how to complete it here.
        # This is my incorrect solution
        # It will output "&lt;p id="p1"&gt;test&lt;/p&gt;",
        # not "<p id='p1'>test</p>"    
        comment.replace_with(comment.string.replace("<!--", "").replace("-->", ""))  
        break   

寻求帮助

2 个答案:

答案 0 :(得分:2)

您可以将新汤而不是字符串放到.replace_with()

from bs4 import BeautifulSoup,Comment

data = """<div class="foo">
 cat dog sheep goat
 <!--<p id="p1">test</p>-->
 <p id="p2">
  test
 </p>
</div>"""
soup = BeautifulSoup(data, 'html.parser')

print('Original soup:')
print('-' * 80)
print(soup)
print()

for comment in soup(text=lambda text: isinstance(text, Comment)):
    if 'id="p1"' in comment.string:
        tag = BeautifulSoup(comment, 'html.parser')
        comment.replace_with(tag)
        break

print('New soup:')
print('-' * 80)
print(soup)
print()

打印:

Original soup:
--------------------------------------------------------------------------------
<div class="foo">
 cat dog sheep goat
 <!--<p id="p1">test</p>-->
<p id="p2">
  test
 </p>
</div>

New soup:
--------------------------------------------------------------------------------
<div class="foo">
 cat dog sheep goat
 <p id="p1">test</p>
<p id="p2">
  test
 </p>
</div>

答案 1 :(得分:0)

您是否考虑过仅使用正则表达式而不是bs4?

也许这可以帮助您入门。

>>> re.search("<!--((.*)p1(.*))-->", '<!--<p id="p1">test</p>-->')
<re.Match object; span=(0, 26), match='<!--<p id="p1">test</p>-->'>
>>> re.search("<!--((.*)p1(.*))-->", '<!--<p id="p1">test</p>-->').group(1)
'<p id="p1">test</p>'
>>> regex = re.compile("<!--((.*)p1(.*))-->")
>>> regex.sub('<p id="p1">test</p>', '<!--<p id="p1">test</p>-->')
'<p id="p1">test</p>'