我需要使用BeautifulSoup编辑一些现有的html文件。 DOCTYPE包含ATTLIST元素时出现问题。
这是一个最小的例子。
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify())
输出为
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
如图所示,DOCTYPE的最后一个'>'变成一个实体。 与
print(soup.prettify(formatter=None))
我明白了
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type">
<meta content="CA43667" name="dc:identifier">
</head>
</html>
现在DOCTYPE很好,但是“元”元素中的斜杠消失了,并且该文档将无法在我们的系统上验证。其他格式化程序选项似乎也不起作用。
有什么解决办法吗?
答案 0 :(得分:0)
您正在运行最新版本的BeautifulSoup吗?我认为您只需要更新BeautifulSoup。或它可能是BeautifulSoup的奇怪安装。在命令行中尝试以下操作:
pip uninstall beautifulsoup4
pip install beautifulsoup4
当我运行此命令时:
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify(formatter=None))
这是输出:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
我相信您正在寻找的是什么。我也在在线IDE上进行了测试,似乎与我的计算机匹配。这是链接:https://onlinegdb.com/HyzXahzAE