使用BeautifulSoup清理标记但跳过特定的HTML注释

时间:2017-05-31 20:27:19

标签: python-2.7 beautifulsoup

我正在使用BeautifulSoup从语义上清理一些HTML,并希望将所有样式,元,链接标记移动到head标记中。

继承我正在使用的HTML:

<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
 xmlns:v="urn:schemas-microsoft-com:vml"
 xmlns:o="urn:schemas-microsoft-com:office:office">

<!--[if gte mso 9]><xml>
 <o:OfficeDocumentSettings>
   <o:AllowPNG/>
   <o:PixelsPerInch>96</o:PixelsPerInch>
 </o:OfficeDocumentSettings>
</xml><![endif]-->

<!--[if !mso]><!-- -->
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">
<!--<![endif]-->

<style type="text/css">
h1, h2, h3, h4, h5, h6 {
    margin: 0;
    padding: 0;
    border: 0;
    font-size: 100%;
    font: inherit;
    vertical-align: baseline;
}
</style>

<body>
    <p>Hello, World</p>
</body>
</html>

这是我的python方法:

def cleanup_markup(html):
    soup = BeautifulSoup(html, "html.parser")
    tags = soup.find_all(['style', 'meta', 'link'])
    conditional_search = r"<!.*\[if(.*)\](.*\n)*(.*)endif\]-->"
    re_flags = re.MULTILINE | re.DOTALL
    search = re.findall(conditional_search, html, flags=re_flags)
    found = filter(lambda a: a not in map(str, tags), search)
    head_tag = soup.head or soup.new_tag('head')

    for tag in tags:
        if tag.name not in found:
            head_tag.append(tag.extract())

    if not soup.head:
        soup.html.insert(0, head_tag)

    return unicode(soup)

但每次上面运行该方法时,标记看起来像:

<!DOCTYPE html>

<html>
<head>
    <title>
    </title>
</head>

<body>
    <link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">\n<!--<![endif]-->

    <p>Hello, World</p>
    \n\n
    <style type="text/css">
    \nh1, h2, h3, h4, h5, h6 {\n\tmargin: 0;\n\tpadding: 0;\n\tborder: 0;\n\tfont-size: 100%;\n\tfont: inherit;\n\tvertical-align: baseline;\n}\n
    </style>\n<!--[if gte mso 9]><xml>\n <o:OfficeDocumentSettings>\n   <o:AllowPNG/>\n   <o:PixelsPerInch>96</o:PixelsPerInch>\n </o:OfficeDocumentSettings>\n</xml><![endif]-->\n<!--[if !mso]><!== -->
</body>
</html>

我基本上需要跳过条件标签,以便它们保持原位,但BeautifulSoup会以奇怪的方式改变现状。

1 个答案:

答案 0 :(得分:0)

使用正确的解析器,您将获得&#34; head&#34;元素免费。从那里,您只需使用extract方法删除评论。

In [10]: from bs4 import Comment

In [11]: from bs4 import BeautifulSoup

In [12]: soup = BS(html, "html5lib")

In [13]: for c in soup.find_all(text=lambda t: isinstance(t, Comment)):
    ...:     c.extract()
    ...: 

In [14]: soup
Out[14]: 
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml"><head><link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet"/>


<style type="text/css">
h1, h2, h3, h4, h5, h6 {
    margin: 0;
    padding: 0;
    border: 0;
    font-size: 100%;
    font: inherit;
    vertical-align: baseline;
}
</style>

</head><body>
    <p>Hello, World</p>

</body></html>