BeautifulSoup将>更改为>

时间:2019-06-03 13:45:14

标签: python beautifulsoup

我需要使用BeautifulSoup编辑一些现有的html文件。 DOCTYPE包含ATTLIST元素时出现问题。

这是一个最小的例子。

from bs4 import BeautifulSoup

doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
    <meta content="CA43667" name="dc:identifier"/>
  </head>
</html>
"""

soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify())

输出为

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]&gt;
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
  <meta content="CA43667" name="dc:identifier"/>
 </head>
</html>

如图所示,DOCTYPE的最后一个'>'变成一个实体。 与

print(soup.prettify(formatter=None))

我明白了

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type">
  <meta content="CA43667" name="dc:identifier">
 </head>
</html>

现在DOCTYPE很好,但是“元”元素中的斜杠消失了,并且该文档将无法在我们的系统上验证。其他格式化程序选项似乎也不起作用。

有什么解决办法吗?

1 个答案:

答案 0 :(得分:0)

您正在运行最新版本的BeautifulSoup吗?我认为您只需要更新BeautifulSoup。或它可能是BeautifulSoup的奇怪安装。在命令行中尝试以下操作:

pip uninstall beautifulsoup4
pip install beautifulsoup4

当我运行此命令时:

from bs4 import BeautifulSoup

doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
    <meta content="CA43667" name="dc:identifier"/>
  </head>
</html>
"""

soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify(formatter=None))

这是输出:

<?xml version='1.0' encoding='UTF-8'?>                                                                                                                                             
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"                                                                                                                     
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"                                                                                                                          
[<!ATTLIST span bodyref CDATA #IMPLIED>                                                                                                                                            
]>                                                                                                                                                                                 
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">                                                                                                                          
 <head>                                                                                                                                                                            
  <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>                                                                                                 
  <meta content="CA43667" name="dc:identifier"/>                                                                                                                                   
 </head>                                                                                                                                                                           
</html>      

我相信您正在寻找的是什么。我也在在线IDE上进行了测试,似乎与我的计算机匹配。这是链接:https://onlinegdb.com/HyzXahzAE