Question

我有超过1000个HTML文件，这些文件具有不同的格式，元素和内容。我需要递归地遍历每个元素，然后选择除<h1>元素之外的所有元素。

这是一个示例文件（请注意，这是文件中最小最简单的文件，其余部分实际上更大，也更复杂，其中包含许多不符合任何单个模板的元素，除了以{{ 1}}元素）：

<h1>

我使用beautifulsoup编写了这段代码：

<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>

<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>

<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

我希望这会选择with open("file.htm") as ip: #HTML parsing done using the "html.parser". soup = BeautifulSoup(ip, "html.parser") selection = soup.select("h1 > ") print(selection)元素下面的所有内容，但是不会。使用<h1>仅选择一行，而不选择其下方的所有内容。我该怎么办？

Answer 1

使用.extract()删除所选标签

output = None
with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    soup.h1.extract()
    output = soup

print(output)

Answer 2

您是否考虑过使用<h1>...<h1/>删除.decompose()元素，然后仅获取其余所有元素？

使用Python beautifulsoup选择除特定标签之外的所有内容

2 个答案: