Question

显然，<p>标签仅在段落中允许以下标签但不包括时才需要关闭。

# case 1
<div>
<p>Sentence 1.
<span>Interjection!<span>
<p>Sentence 2.
</div>

这将以两段结尾（我认为），就像写了一段一样：

# case 2
<div>
<p>Sentence 1. <span>Interjection!<span></p>
<p>Sentence 2.</p>
</div>

就我而言，我希望BeautifulSoup将该段落解析为标准所规定的内容。但特别是，在下面的示例（案例3）中，我只需要“句子1”，而将标题保留在其中。

# case 3
<div>
<p>Sentence 1. 
<h2>Interjection!<h2>
<p>Sentence 2.
</div>

目前，BeautifulSoup会继续解析（例如第4种情况），但这不是我在网上浏览此类网站（使用chrome）时看到的内容。

# case 4 (bs4 currently)
<div>
<p>Sentence 1. <h2>Interjection!<h2> <p>Sentence 2.
</div>
</p>
</p>

我正在使用html.parser。在这里可以使用其他解析器吗？

Answer 1

假设您有<span>和<h2>的错字，因为它没有结束标记或缺少/，否则将创建其他空标记。

是的，使用lxml这样的其他解析器可以帮助修复结构，并且结果与HTML标准（Chrome）相同。

情况1输入：

<div>
<p>Sentence 1.
<span>Interjection!</span>
<p>Sentence 2.
</div>

情况1的结果，忽略了<html><body>：

<div>
<p>Sentence 1. <span>Interjection!</span></p>
<p>Sentence 2.</p>
</div>

案例2输入

<div>
<p>Sentence 1. 
<h2>Interjection!</h2>
<p>Sentence 2.
</div>

案例2的结果

<div>
<p>Sentence 1.</p>
<h2>Interjection!</h2>
<p>Sentence 2.</p>
</div>

不同之处在于h2不包装在p中，因为它是block element或在新行上开始，因此解析器将停止并关闭标签，而{{1} }是span。