Question

我目前正在使用BeautifulSoup重新格式化一些HTML页面，我遇到了一些问题。

我的问题是原始HTML有这样的东西：

<li><p>stff</p></li>

和

<li><div><p>Stuff</p></div></li>

以及

<li><div><p><strong>stff</strong></p></div><li>

使用BeautifulSoup我希望消除div和p标签（如果存在），但保留强标签。

我正在浏览美丽的汤文档，找不到任何内容。想法？

感谢。

Answer 1

您可以使用replaceWith完成您想要做的事情。您必须复制要用作替换的元素，然后将其作为参数提供给replaceWith。 documentation for replaceWith非常明确如何做到这一点。

Answer 2

这个问题可能是指旧版的BeautifulSoup，因为使用bs4你可以简单地使用 unwrap 功能：

s = BeautifulSoup('<li><div><p><strong>stff</strong></p></div><li>')
s.div.unwrap()
>> <div></div>
s.p.unwrap()
>> <p></p>
s
>> <html><body><li><strong>stff</strong></li><li></li></body></html>

Answer 3

您可以编写自己的函数来剥离标记：

import re

def strip_tags(string):
    return re.sub(r'<.*?>', '', string)

strip_tags("<li><div><p><strong>stff</strong></p></div><li>")
'stff'

Answer 4

对于这个简单的问题，我看到了很多答案，我也来这里看到有用的东西，但是不幸的是我没有得到我想要的东西，然后经过几次尝试，我找到了这个问题的简单解决方案，就在这里

soup = BeautifulSoup(htmlData, "html.parser")

h2_headers = soup.find_all("h2")

for header in h2_headers:
    header.name = "h1" # replaces h2 tag with h1

所有h2标签都转换为h1。您只需更改名称即可转换任何标签。

Answer 5

简单的解决方案使您的整个节点意味着div：

转换为字符串
用所需的标记/字符串替换<tag>。
用空字符串替换相应的标签。

通过传递给beautifulsoup将转换后的字符串转换为可解析的字符串

我为mint做的事情

示例：

<div class="col-md-12 option" itemprop="text">
<span class="label label-info">A</span>

**-2<sup>31</sup> to 2<sup>31</sup>-1**

sup = opt.sup 
    if sup: //opt has sup tag then

         //opts converted to string. 
         opt = str(opts).replace("<sup>","^").replace("</sup>","") //replacing

         //again converted from string to beautiful string.
         s = BeautifulSoup(opt, 'lxml')

         //resign to required variable after manipulation
         opts = s.find("div", class_="col-md-12 option")

输出：

-2^31 to 2^31-1
without manipulation it will like this (-231 to 231-1)

用BeautifulSoup替换html标签

5 个答案: