Question

我正试图'defrontpagify'MS FrontPage生成的网站的html，我正在写一个BeautifulSoup脚本来做它。

但是，我试图从包含它们的文档中的每个标记中删除特定属性（或列表属性）的部分。代码段：

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

它运行没有错误，但实际上并没有删除任何属性。当我在没有外部循环的情况下运行它时，只需对单个属性进行硬编码（soup.findAll（'style'= True），它就可以工作。

任何人都知道这里有问题吗？

PS - 我也不太喜欢嵌套循环。如果有人知道更具功能性的map / filter-ish风格，我很乐意看到它。

Answer 1

该行

for tag in soup.findAll(attribute=True):

找不到任何tag。可能有一种方法可以使用findAll;我不确定。但是，这有效：

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

Answer 2

我正在使用带有python 2.7的BeautifulSoup 4，对我来说tag.attrs是一个字典而不是列表。因此我不得不修改这段代码：

    for tag in soup.recursiveChildGenerator():
        if hasattr(tag, 'attrs'):
            tag.attrs = {key:value for key,value in tag.attrs.iteritems() 
                         if key not in REMOVE_ATTRIBUTES}

Answer 3

仅此而已：这里的问题是，如果您将HTML属性作为关键字参数传递，则关键字是该属性的 name 。因此，您的代码正在搜索属性名称为attribute的标签，因为该变量不会扩展。

这就是为什么

硬编码您的属性名称有效[0]
代码不会失败。搜索与任何标签都不匹配

要解决此问题，请将您要查找的属性传递为dict：

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

将来有人， dtk

[0]：尽管在您的示例中它必须为find_all(style=True)，但不带引号，因为SyntaxError: keyword can't be an expression

Answer 4

我使用此方法删除非常紧凑的属性列表：

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]

Answer 5

我用这个：

if "align" in div.attrs:
    del div.attrs["align"]

或

if "align" in div.attrs:
    div.attrs.pop("align")

感谢https://stackoverflow.com/a/22497855/1907997

BeautifulSoup：剥离指定的属性，但保留标记及其内容

5 个答案: