使用BeautifulSoup 4提取div的内容,包括标签

时间:2017-08-07 13:03:30

标签: python html beautifulsoup python-requests

以下我喜欢的时候你好:

soup.find('div', id='id1')

我是这样的:

<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>

我只需要这样:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>

有没有办法获得上面的内容?我试过.contents但没有得到我需要的东西。

由于

4 个答案:

答案 0 :(得分:1)

from bs4 import BeautifulSoup

html = """<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>"""

soup = BeautifulSoup(html, 'html.parser')
el = soup.find('div', id='id1')
print el.decode_contents(formatter="html")

输出:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>

答案 1 :(得分:0)

使用contents我得到了以下内容:

[u'\n', <p id="ptag"> hello this is "p" tag</p>, u'\n', <span id="spantag"> hello this is "p" tag</span>, u'\n', <div id="divtag"> hello this is "p" tag</div>, u'\n', <h1 id="htag"> hello this is "p" tag</h1>, u'\n']

遍历列表,您可以轻松获得所需的输出(跳过\n元素)。

答案 2 :(得分:0)

我假设soup.find是变量名,那么:

soup.find = re.sub("<div>.*<\/div>", "", soup.find) 

可能会有效。

答案 3 :(得分:0)

BeautifulSoup中有一项特定功能可以完全满足您的需求 - unwrap()

  

Tag.unwrap()wrap()相反。它用标签内的任何内容替换标签。剥离标记很有用

工作示例:

from bs4 import BeautifulSoup


data = """
<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
soup.div.unwrap()

print(soup)

会打印:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>