Question

我有一些看起来像这样的HTML：

<h1>Title</h1>

//a random amount of p/uls or tagless text

<h1> Next Title</h1>

我想将所有HTML从第一个h1复制到下一个h1。我怎样才能做到这一点？

Answer 1

这是明确的BeautifulSoup方式，当第二个h1标签是第一个的兄弟：

html = u""
for tag in soup.find("h1").next_siblings:
    if tag.name == "h1":
        break
    else:
        html += unicode(tag)

Answer 2

我有同样的问题。不确定是否有更好的解决方案，但我所做的是使用正则表达式来获取我正在寻找的两个节点的索引。有了这个，我在两个索引之间提取HTML并创建一个新的BeautifulSoup对象。

示例：

m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)

Answer 3

有趣的问题。你无法使用DOM来选择它。你必须循环通过第一个h1之前的所有元素（包括）并将它们放入intro = str（intro），然后将第二个h1的所有元素放到chapter1中。然后使用

删除chapter1中的介绍

chapter = chapter1.replace(intro, '')

Answer 4

这是一个完整的最新解决方案：

temp.html的内容：

<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>

代码：

import copy

from bs4 import BeautifulSoup

with open("resources/temp.html") as file_in:
    soup = BeautifulSoup(file_in, "lxml")

print(f"Before:\n{soup.prettify()}")

first_header = soup.find("body").find("h1")

siblings_to_add = []

for curr_sibling in first_header.next_siblings:
    if curr_sibling.name == "h1":
        for curr_sibling_to_add in siblings_to_add:
            curr_sibling.insert_after(curr_sibling_to_add)
        break
    else:
        siblings_to_add.append(copy.copy(curr_sibling))

print(f"\nAfter:\n{soup.prettify()}")

输出：

Before:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
 </body>
</html>

After:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
  //a random amount of p/uls or tagless text
  <p>
   hi
  </p>
 </body>
</html>

使用BeautifulSoup来获取两个标签之间的所有HTML

4 个答案: