Question

我安装了BeautifulSoup4 v4.6.0和lxml v3.8.0。我正在尝试解析以下xhtml。

我要解析的代码：

from bs4 import BeautifulSoup

xhtml_string = """
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    </head>

    <body class="sgc-1">
      <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveAspectRatio="xMidYMid meet" version="1.1" viewBox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink">
        <image height="800" width="573" xlink:href="../Images/Cover.jpg"></image>
      </svg>
    </body>
</html>
"""

soup = BeautifulSoup(xhtml_string, 'xml')

然而，当我检查汤时，看起来BeautifulSoup已剥离xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"上的<svg>和xlink前缀href <image> } soup.prettify()标记上的}属性。

即。 <?xml version="1.0" encoding="unicode-escape"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body class="sgc-1"> <svg height="100%" preserveAspectRatio="xMidYMid meet" version="1.1" viewBox="0 0 600 800" width="100%"> <image height="800" href="../Images/Cover.jpg" width="573"/> </svg> </body> </html>返回以下内容

xhtml

我没有选择更改源xmlns以及我看到的xhtml声明有效。有没有办法让BeautifulSoup保留slice(2,4)原样？

Answer 1

您应该使用lxml解析器而不是xml。

soup = BeautifulSoup(xhtml_string, 'lxml')

带有lxml xml解析器的BeautifulSoup4从xhtml文件

1 个答案: