我将网页保存为.htm。从本质上讲,我需要解析6层div,并从中获取特定数据。我对如何处理这个问题感到很困惑。我尝试了不同的技术,但没有任何工作。
HTM文件有一堆标签,但有一个div看起来像这样:
<div id="fbbuzzresult" class.....>
<div class="postbuzz"> .... </div>
<div class="linkbuzz">...</div>
<div class="descriptionbuzz">...</div>
<div class="metabuzz>
<div class="time">...</div>
<div>
<div class="postbuzz"> .... </div>
<div class="postbuzz"> .... </div>
<div class="postbuzz"> .... </div>
</div>
我正在尝试BeautifulSoup。更多背景......
我需要在每个 postbuzz div中提取并打印上面显示的每个内容。
非常感谢您对某些骨架代码的帮助和指导! P.S - 忽略div类中的破折号。 谢谢!
答案 0 :(得分:3)
您应该能够以与父soup
相同的方式使用您的结果:
from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
post_buzz = div.findAll("div",{"class":"postbuzz"})
但是在这样做之前我遇到了错误,因此作为辅助方法,你可以做出某种sub_soup
:
from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
sub_soup = bs(str(div))
post_buzz = sub_soup.findAll("div",{"class":"postbuzz"})
答案 1 :(得分:1)
首先阅读BeautifulSoup文档http://www.crummy.com/software/BeautifulSoup/bs4/doc/
其次,这是一个让你前进的小例子:
from bs4 import BeautifulSoup as bs
soup = bs(your_html_content)
# for fbbuzzresult
buzz = soup.findAll("div", {"id" : "fbbuzzresult"})[0]
# to get postbuzz
pbuzz = buzz.findAll("div", {"class" : "postbuzz"})
"""pbuzz is now an array with the postbuzz divs
so now you can iterate through them, get
the contents, keep traversing the DOM with BS
or do whatever you are trying to do
So say you want the text from an element, you
would just do: the_element.contents[0]. However
if I'm remembering correctly you have to traverse
down through all of it's children to get the text.
"""