如何使用自定义<comment>元素</comment>替换HTML注释

时间:2015-02-18 16:27:31

标签: python html regex xml beautifulsoup

我正在使用Python中的BeautifulSoup将大量HTML文件批量转换为XML。

示例HTML文件如下所示:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

我想出了如何找到doctype并将其替换为标签<doctype>...</doctype>,但评论给了我很多挫折感。我想用<comment>...</comment>替换HTML注释。在这个示例HTML中,我能够替换前两个HTML注释,但html标记内的任何内容和关闭html标记之后的最后一个注释我都没有。

这是我的代码:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

这是我第一次使用BeautifulSoup。如何使用BeautifulSoup查找并替换所有带有<comment>标记的HTML评论?

我可以通过pickle将其转换为字节流,对其进行序列化,应用正则表达式,然后将其反序化为BeautifulSoup对象吗?这会起作用还是只会造成更多问题?

我尝试在子标记对象上使用pickle,但反序列化失败了TypeError: __new__() missing 1 required positional argument: 'name'

然后我尝试通过child.text仅对标记文字进行搜索,但由于AttributeError: can't set attribute而反序列化失败。基本上,child.text是只读的,这解释了为什么正则表达式不起作用。所以,我不知道如何修改文本。

1 个答案:

答案 0 :(得分:4)

你有几个问题:

  1. 您无法修改child.text。它是一个只读属性,只是在幕后调用get_text(),其结果是一个全新的字符串,与您的文档无关。

  2. re.sub()无法就地修改任何内容。你的行

    re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
    

    必须是

    child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
    

    ......但由于第1点的原因,这无论如何都无法发挥作用。

  3. 尝试通过用正则表达式替换其中的文本块来修改文档是使用BeautifulSoup的错误方法。相反,您需要查找节点并将其替换为其他节点。

  4. 这是一个有效的解决方案:

    import bs4
    
    with open("example.html") as f:
        soup = bs4.BeautifulSoup(f)
    
    for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
        tag = bs4.Tag(name="comment")
        tag.string = comment.strip()
        comment.replace_with(tag)
    

    此代码首先迭代调用find_all()的结果,利用pass a function作为text参数的事实。在BeautifulSoup中,CommentNavigableString的子类,因此我们将其作为字符串进行搜索,而lambda ...只是一种简写,例如。

    def is_comment(e):
        return isinstance(e, bs4.Comment)
    
    soup.find_all(text=is_comment)
    

    然后,我们创建一个具有相应名称的新Tag,将其内容设置为原始评论的剥离内容,并将评论替换为我们刚刚创建的标记。

    结果如下:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    
    <comment>this is an HTML comment</comment>
    <comment>this is another HTML comment</comment>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
            ...
            <comment>here is a comment inside the head tag</comment>
    </head>
    <body>
            ...
            <comment>Comment inside body tag</comment>
    <comment>Another comment inside body tag</comment>
    <comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
    </body>
    </html>
    <comment>This comment is the last line of the file</comment>