我在XML文档中嵌套了一些HTML,这些HTML嵌入了一些其他更深层次的嵌套标签,这些标签仍然包含HTML,BODY和HEAD标签,但Beautifulsoup正在删除/更改它们。有没有办法阻止BS破坏这些标签的顺序?
编辑代码添加:
html1 = """
<?xml version="1.0" encoding="UTF-8"?>
<sss>
<aaa>
<bbbb>
<ppe>
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script type="text/javascript">
</script>
<script type="text/javascript">
</script>
<script type="text/javascript">
</script>
<script language="Javascript1.1" type="text/javascript">
</script>
<title>
</title>
<script type="text/javascript">
</script>
</head>
<body class="pet_products en_US" id="dp">
<div id="a-page">
<script>
</script>
<script type="text/javascript">
</script>
<div id="PrimeStripeContent">
</div>
<div id="rwImages_hidden" style="display:none;">
</div>
<div class="a-container">
</div>
</div>
</body>
</html>
</ppe>
</bbbb>
</aaa>
</sss>"""
html = BeautifulSoup(html1)
print html.prettify()
它会直接撕掉html,head和body标签并重新排列它
答案 0 :(得分:2)
使用BeautifulSoup解析XML文件时,构造函数应为
html = BeautifulSoup(html1, features="xml")
记录在案here。但是,为了使用xml功能,需要安装lxml
。安装说明here。
>>> html = BeautifulSoup(html1, features="xml")
>>> print html.prettify()
<?xml version="1.0" encoding="utf-8"?>
<sss>
<aaa>
<bbbb>
<ppe>
<html class="a-no-js" data-19ax5a9jf="dingo">
<head>
<script type="text/javascript">
</script>
<script type="text/javascript">
</script>
<script type="text/javascript">
</script>
<script language="Javascript1.1" type="text/javascript">
</script>
<title>
</title>
<script type="text/javascript">
</script>
</head>
<body class="pet_products en_US" id="dp">
<div id="a-page">
<script>
</script>
<script type="text/javascript">
</script>
<div id="PrimeStripeContent">
</div>
<div id="rwImages_hidden" style="display:none;">
</div>
<div class="a-container">
</div>
</div>
</body>
</html>
</ppe>
</bbbb>
</aaa>
</sss>