Question

我正在使用BeautifulSoup进行一些HTML清理。 Noob对Python和＆amp; BeautifulSoup。根据我在Stackoverflow上的其他地方找到的答案，我按照以下方式正确删除了标签：

[s.extract() for s in soup('script')]

但是如何删除内联样式？例如以下内容：

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

应该成为：

<p>Text</p>
<img href="somewhere.com">

如何删除内联类，id，name＆amp;所有元素的样式属性？

其他类似问题的答案我可以找到所有提到使用CSS解析器来处理这个，而不是BeautifulSoup，但由于任务只是删除而不是操纵属性，并且是所有标签的一揽子规则，我是希望在BeautifulSoup中找到一种方法。

Answer 1

如果您只想删除所有CSS，则无需解析任何CSS。 BeautifulSoup提供了一种删除整个属性的方法：

for tag in soup():
    for attribute in ["class", "id", "name", "style"]:
        del tag[attribute]

此外，如果您只想删除整个标记（及其内容），则不需要extract()，它会返回标记。您只需要decompose()：

[tag.decompose() for tag in soup("script")]

没有太大的区别，但只是我在查看文档时发现的其他内容。您可以在BeautifulSoup documentation中找到有关API的更多详细信息，其中包含许多示例。

Answer 2

我不会在BeautifulSoup中执行此操作 - 您将花费大量时间尝试，测试和处理边缘情况。

Bleach正是这样做的。 http://pypi.python.org/pypi/bleach

如果您要在BeautifulSoup中执行此操作，我建议您使用“白名单”方法，例如Bleach。确定哪些标记可能具有哪些属性，并删除不匹配的每个标记/属性。

Answer 3

基于jmk的功能，我使用此功能删除白名单上的属性：

在python2，BeautifulSoup3

中工作

def clean(tag,whitelist=[]):
    tag.attrs = None
    for e in tag.findAll(True):
        for attribute in e.attrs:
            if attribute[0] not in whitelist:
                del e[attribute[0]]
        #e.attrs = None     #delte all attributes
    return tag

#example to keep only title and href
clean(soup,["title","href"])

Answer 4

这是我对Python3和BeautifulSoup4的解决方案：

def remove_attrs(soup, whitelist=tuple()):
    for tag in soup.findAll(True):
        for attr in [attr for attr in tag.attrs if attr not in whitelist]:
            del tag[attr]
    return soup

它支持应保留的属性白名单。 :)如果没有提供白名单，则删除所有属性。

Answer 5

不完美，但简短：

' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

Answer 6

lxml 的 Cleaner 怎么样？

from lxml.html.clean import Cleaner

content_without_styles = Cleaner(style=True).clean_html(content)

使用BeautifulSoup删除所有内联样式

6 个答案: