Question

我想用BeautifulSoup包装标签的内容。这样：

<div class="footnotes">
    <p>Footnote 1</p>
    <p>Footnote 2</p>
</div>

应该成为这个：

<div class="footnotes">
  <ol>
    <p>Footnote 1</p>
    <p>Footnote 2</p>
  </ol>
</div>

所以我使用以下代码：

footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol") 
for content in footnotes[0].children:
    new_tag = soup.new_tag(content)
    new_ol.append(new_tag)

footnotes[0].clear()
footnotes[0].append(new_ol)

print footnotes[0]

但我得到以下内容：

<div class="footnotes"><ol><
    ></
    ><<p>Footnote 1</p>></<p>Footnote 1</p>><
    ></
    ><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>

建议？

Answer 1

使用tag.extract()移动标记的 .contents ;不要尝试使用soup.new_tag（仅使用标记名称，而不是整个标记对象）重新创建它们。不要在原始标签上拨打.clear(); .extract()已删除元素。

将项目移到上反向，因为内容正在被修改，如果你不注意，会导致跳过元素。

最后，当您只需为一个标记执行此操作时，请使用.find()。

您需要创建contents列表的副本，因为它将在适当的位置进行修改

footnotes = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnotes.contents):
    new_ol.insert(0, content.extract())

footnotes.append(new_ol)

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <div class="footnotes">
...     <p>Footnote 1</p>
...     <p>Footnote 2</p>
... </div>
... ''')
>>> footnotes = soup.find("div", { "class" : "footnotes" })
>>> new_ol = soup.new_tag("ol")
>>> for content in reversed(footnotes.contents):
...     new_ol.insert(0, content.extract())
... 
>>> footnotes.append(new_ol)
>>> print footnotes
<div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div>

Answer 2

使用lxml：

import lxml.html as LH
import lxml.builder as builder
E = builder.E

doc = LH.parse('data')
footnote = doc.find('//div[@class="footnotes"]')
ol = E.ol()
for tag in footnote:
    ol.append(tag)
footnote.append(ol)
print(LH.tostring(doc.getroot()))

打印

<html><body><div class="footnotes">
    <ol><p>Footnote 1</p>
    <p>Footnote 2</p>
</ol></div></body></html>

请注意，对于lxml，元素（标记）只能位于树中的一个位置（因为每个元素只有一个父元素），因此tag也附加到ol将其从footnote中删除。因此与BeautifulSoup不同，您不需要以相反的顺序迭代内容，也不需要使用insert(0,...)。你只需按顺序追加。

使用BeautifulSoup：

import bs4 as bs
with open('data', 'r') as f:
    soup = bs.BeautifulSoup(f)

footnote = soup.find("div", { "class" : "footnotes" })
new_ol = soup.new_tag("ol")

for content in reversed(footnote.contents):
    new_ol.insert(0, content.extract())

footnote.append(new_ol)
print(soup)

打印

<html><body><div class="footnotes"><ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol></div></body></html>

使用BeautifulSoup包装标记的内容

2 个答案: