HOWTO:beautifulsoup4通过find或find_all删除内容和标签

时间:2016-06-23 07:11:29

标签: beautifulsoup

我解决了这个问题,虽然我通过find发现了标签和争议,问题是如何删除它,这是我的例子:

html="""<a href="http://digi.tech.com/a/20160621/050783.htm" rel="nofollow" <table cellpadding="0" cellspacing="0" class="atd"><tbody><tr><td id="article_content"><p align="center" class="pageLink">
</p>
</td></tr></tbody></table>
<p style="text-align: center;"><img alt=" " data-bd-imgshare-binded="1" height="220" src="/skin/vr186/images/wxin.jpg" width="220"/></p>
<p style="text-align: center;"><span style="color: rgb(102, 204, 204);"><strong>every day 5</strong></span></p>
<div id="click_div"><div class="left_boxs_tit4"><div class="blank10"></div>
<a name="pl"></a>
<div class="blank20"></div><div class="feelings"><iframe frameborder="0" height="200" id="mood_frame" marginheight="0" marginwidth="0" scrolling="no" src="/e/extend/mood/?classid=2&amp;id=4559" width="538"></iframe></div></div></div>"""

我现在可以使用以下内容获取上面的内容和标签

a = beautifulsoup(html)
fst =  a.find(class_="atd")
next_siblings = fst.find_next_siblings()

获取以下字符串:

<table cellpadding="0" cellspacing="0" class="atd"><tbody><tr><td id="article_content"><p align="center" class="pageLink">
</p>
</td></tr></tbody></table>
<p style="text-align: center;"><img alt=" " data-bd-imgshare-binded="1" height="220" src="/skin/vr186/images/wxin.jpg" width="220"/></p>
<p style="text-align: center;"><span style="color: rgb(102, 204, 204);"><strong>every day 5</strong></span></p>
<div id="click_div"><div class="left_boxs_tit4"><div class="blank10"></div>
<a name="pl"></a>
<div class="blank20"></div><div class="feelings"><iframe frameborder="0" height="200" id="mood_frame" marginheight="0" marginwidth="0" scrolling="no" src="/e/extend/mood/?classid=2&amp;id=4559" width="538"></iframe></div></div>

,我无法通过del [&#39; tag_name&#39;]将其删除,因为它只是一篇长篇文章的一小部分,如何删除标签及其内容由其id ???

1 个答案:

答案 0 :(得分:0)

您只需选择元素并提取

html="""<a href="http://digi.tech.com/a/20160621/050783.htm" rel="nofollow" <table cellpadding="0" cellspacing="0" class="atd"><tbody><tr><td id="article_content"><p align="center" class="pageLink">
</p>
</td></tr></tbody></table>
<p style="text-align: center;"><img alt=" " data-bd-imgshare-binded="1" height="220" src="/skin/vr186/images/wxin.jpg" width="220"/></p>
<p style="text-align: center;"><span style="color: rgb(102, 204, 204);"><strong>every day 5</strong></span></p>
<div id="click_div"><div class="left_boxs_tit4"><div class="blank10"></div>
<a name="pl"></a>
<div class="blank20"></div><div class="feelings"><iframe frameborder="0" height="200" id="mood_frame" marginheight="0" marginwidth="0" scrolling="no" src="/e/extend/mood/?classid=2&amp;id=4559" width="538"></iframe></div></div></div>"""


soup = BeautifulSoup(html)

fst =  soup.find(class_="atd")

fst.select_one("#click_div").extract()
print(fst) 

哪个会给你:

<a cellpadding="0" cellspacing="0" class="atd" href="http://digi.tech.com/a/20160621/050783.htm" rel="nofollow"><tbody><tr><td id="article_content"><p align="center" class="pageLink">
</p>
</td></tr></tbody>
<p style="text-align: center;"><img alt=" " data-bd-imgshare-binded="1" height="220" src="/skin/vr186/images/wxin.jpg" width="220"/></p>
<p style="text-align: center;"><span style="color: rgb(102, 204, 204);"><strong>every day 5</strong></span></p>
</a>

如果您愿意,可以使用fst.find(id="click_div").extract(),结果将是相同的。