解析XML时,Beautifulsoup删除HTML标记

时间:2014-04-02 19:08:04

标签: python html xml beautifulsoup

我在XML文档中嵌套了一些HTML,这些HTML嵌入了一些其他更深层次的嵌套标签,这些标签仍然包含HTML,BODY和HEAD标签,但Beautifulsoup正在删除/更改它们。有没有办法阻止BS破坏这些标签的顺序?

编辑代码添加:

html1 = """
<?xml version="1.0" encoding="UTF-8"?>
<sss>
  <aaa>
    <bbbb>
      <ppe>
        <html class="a-no-js" data-19ax5a9jf="dingo">
         <head>
          <script type="text/javascript">
          </script>
          <script type="text/javascript">
          </script>
          <script type="text/javascript">
          </script>
          <script language="Javascript1.1" type="text/javascript">
          </script>
          <title>
          </title>
          <script type="text/javascript">
          </script>
         </head>
         <body class="pet_products en_US" id="dp">
          <div id="a-page">
           <script>
           </script>
           <script type="text/javascript">
           </script>
           <div id="PrimeStripeContent">
           </div>
           <div id="rwImages_hidden" style="display:none;">
           </div>
           <div class="a-container">
           </div>
          </div>
         </body>
        </html>
      </ppe>
    </bbbb>
  </aaa>
</sss>"""

html = BeautifulSoup(html1)

print html.prettify()

它会直接撕掉html,head和body标签并重新排列它

1 个答案:

答案 0 :(得分:2)

使用BeautifulSoup解析XML文件时,构造函数应为

html = BeautifulSoup(html1, features="xml")

记录在案here。但是,为了使用xml功能,需要安装lxml。安装说明here

>>> html = BeautifulSoup(html1, features="xml")
>>> print html.prettify()
<?xml version="1.0" encoding="utf-8"?>
<sss>
 <aaa>
  <bbbb>
   <ppe>
    <html class="a-no-js" data-19ax5a9jf="dingo">
     <head>
      <script type="text/javascript">
      </script>
      <script type="text/javascript">
      </script>
      <script type="text/javascript">
      </script>
      <script language="Javascript1.1" type="text/javascript">
      </script>
      <title>
      </title>
      <script type="text/javascript">
      </script>
     </head>
     <body class="pet_products en_US" id="dp">
      <div id="a-page">
       <script>
       </script>
       <script type="text/javascript">
       </script>
       <div id="PrimeStripeContent">
       </div>
       <div id="rwImages_hidden" style="display:none;">
       </div>
       <div class="a-container">
       </div>
      </div>
     </body>
    </html>
   </ppe>
  </bbbb>
 </aaa>
</sss>