我正在使用BeautifulSoup从语义上清理一些HTML,并希望将所有样式,元,链接标记移动到head标记中。
继承我正在使用的HTML:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
<o:PixelsPerInch>96</o:PixelsPerInch>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if !mso]><!-- -->
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">
<!--<![endif]-->
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
<body>
<p>Hello, World</p>
</body>
</html>
这是我的python方法:
def cleanup_markup(html):
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all(['style', 'meta', 'link'])
conditional_search = r"<!.*\[if(.*)\](.*\n)*(.*)endif\]-->"
re_flags = re.MULTILINE | re.DOTALL
search = re.findall(conditional_search, html, flags=re_flags)
found = filter(lambda a: a not in map(str, tags), search)
head_tag = soup.head or soup.new_tag('head')
for tag in tags:
if tag.name not in found:
head_tag.append(tag.extract())
if not soup.head:
soup.html.insert(0, head_tag)
return unicode(soup)
但每次上面运行该方法时,标记看起来像:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">\n<!--<![endif]-->
<p>Hello, World</p>
\n\n
<style type="text/css">
\nh1, h2, h3, h4, h5, h6 {\n\tmargin: 0;\n\tpadding: 0;\n\tborder: 0;\n\tfont-size: 100%;\n\tfont: inherit;\n\tvertical-align: baseline;\n}\n
</style>\n<!--[if gte mso 9]><xml>\n <o:OfficeDocumentSettings>\n <o:AllowPNG/>\n <o:PixelsPerInch>96</o:PixelsPerInch>\n </o:OfficeDocumentSettings>\n</xml><![endif]-->\n<!--[if !mso]><!== -->
</body>
</html>
我基本上需要跳过条件标签,以便它们保持原位,但BeautifulSoup会以奇怪的方式改变现状。
答案 0 :(得分:0)
使用正确的解析器,您将获得&#34; head&#34;元素免费。从那里,您只需使用extract
方法删除评论。
In [10]: from bs4 import Comment
In [11]: from bs4 import BeautifulSoup
In [12]: soup = BS(html, "html5lib")
In [13]: for c in soup.find_all(text=lambda t: isinstance(t, Comment)):
...: c.extract()
...:
In [14]: soup
Out[14]:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml"><head><link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet"/>
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
</head><body>
<p>Hello, World</p>
</body></html>