使用BeautifulSoup编辑DOCTYPE标记

时间:2019-04-29 20:14:22

标签: python beautifulsoup

我需要在html文档的DOCTYPE标记中添加一个ATTLIST声明。

在阅读文档并进行谷歌搜索之后,这就是我想出的:

from bs4 import BeautifulSoup, Doctype

# minimal html document
doc = """<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html/>"""

soup = BeautifulSoup(doc, features='html.parser')

# the modified doctype tag
doctype = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>] >"""

dt = BeautifulSoup(doctype, features='html.parser')

for item in soup.contents:
    if isinstance(item, Doctype):
        item.replace_with(dt)
        break

print(soup.prettify(formatter=None))

这会产生所需的结果,但是感觉有点“ hacky”。 我只想将ATTLIST部分插入标签, 而不是像我在这里所做的那样替换整个内容。 有谁知道该怎么做?

1 个答案:

答案 0 :(得分:0)

一个小改进是构建一个Doctype对象并替换为该对象,例如:

from bs4 import BeautifulSoup, Doctype

# minimal html document
doc = """<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html/>"""

# the modified doctype tag
doctype = """html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]"""

soup = BeautifulSoup(doc, features='html.parser')

for item in soup.contents:
    if isinstance(item, Doctype):
        item.replace_with(Doctype(doctype))
        break

print(soup.prettify(formatter=None))

给予:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html>
</html>