我有一个这样的模板:
<html><body><div id="here"></div></body></html>
和像这样的输入HTML
<html><body>COMPLEX HTML</body></html>
其中COMPLEX_HTML是很多子标签(它是干净的 - 验证)
我正在尝试将输入HTML的body标记内的HTML移动到模板中的#here div
,以获得此
<html><body><div id="here">COMPLEX HTML</div></body></html>
我试过了:
t = BeautifulSoup("<html><body><div id=\"here\"></div></body></html>")
pc = t.find("div", id="here")
s = BeautifulSoup(open("complex.html"))
# this prints every tag in body
for b in s.body.contents:
print b.name
# this prints only some of the tags
for b in s.body.contents:
print b.name
pc.append(b)
pc
以s.body
就像追加b
向前移动迭代器一样。如何从一个汤中取出HTML结构并将其放入另一个汤中?
答案 0 :(得分:1)
你可以这样做:
from bs4 import BeautifulSoup
html = """<html><body><div id="here"></div></body></html>"""
soup = BeautifulSoup(html)
div = soup.find("div", id="here")
html2 = """<html><body><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link rel="stylesheet" type="text/css" href="//cdn.ss tatic.net/stackoverflow/all.css? v=71d362e7c10c">
</body></html>"""
soup1 = BeautifulSoup(html2)
value = soup1.body.extract()
div.append(value)
print div
输出是:
<div id="here"><body><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link href="//cdn.sstatic.net/stackoverflow/all.css?v=71d362e7c10c" rel="stylesheet" type="text/css">
</link></body></div>
如果您想要body
内的内容,您可以这样做:
#the above same lines
soup1 = BeautifulSoup(html2)
value = soup1.body.extract()
div.append(value)
# replaces a tag with whatever’s inside that tag.
div.body.unwrap()
print div
输出是:
<div id="here"><script src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
<script src="//cdn.sstatic.net/Js/stub.en.js?v=283ea58c715b"></script>
<link href="//cdn.sstatic.net/stackoverflow/all.css?v=71d362e7c10c" rel="stylesheet" type="text/css">
</link></div>
答案 1 :(得分:0)
好的,所以append(tag)会从其结构中删除标记,因此实际上会跳过下一个标记(因为你在迭代时改变了结构)
我用过这个
bc = soup.body.contents
while len(bc) > 0:
pc.append(bc[0])
仍然从身体中移除bc [0],但我不依赖于未被改变的结构。
这对我来说没问题,因为我不需要原汤。