Question

我有一个XHTML文件，结构如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

我正在使用BeautifulSoup，我想从文档中删除XML声明，所以我看起来像这样：

<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

我找不到一种方法来获取XML声明来删除它。据我所知，它似乎不是Doctype，Declaration，Tag或NavigableString。有没有办法可以找到它来提取它？

作为一个工作示例，我可以使用这样的代码删除Doctype（假设文档文本是变量“html”）：

soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]

Answer 1

您可以使用以下方法：

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for e in soup:
    if isinstance(e, bs4.element.ProcessingInstruction):
        e.extract()
        break

Answer 2

在一些非常简单的情况下，这对我有用：

from bs4 import BeautifulSoup
s = "<a value='label'/>"
s = BeautifulSoup(s, 'xml')
print(s)
## <?xml version="1.0" encoding="utf-8"?>
## <a value="label"/>

具有bs语法：

s.decode_contents()
## '<a value="label"/>'

带有string.split：

str(s).split("\n")[-1]
## '<a value="label"/>'

如何使用BeautifulSoup4

2 个答案: