使用BeautifulSoup解析多个图层

时间:2014-05-12 17:59:05

标签: python html beautifulsoup extract

我将网页保存为.htm。从本质上讲,我需要解析6层div,并从中获取特定数据。我对如何处理这个问题感到很困惑。我尝试了不同的技术,但没有任何工作。

HTM文件有一堆标签,但有一个div看起来像这样:

<div id="fbbuzzresult" class.....>
   <div class="postbuzz"> .... </div>
      <div class="linkbuzz">...</div>
      <div class="descriptionbuzz">...</div>
      <div class="metabuzz>
         <div class="time">...</div>
      <div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
</div>

我正在尝试BeautifulSoup。更多背景......

  1. 整个文件中只有一个 fbbuzzresult
  2. fbbuzzresult
  3. 中有多个 postbuzz
  4. 如上所示,在 postbuzz
  5. 中有div

    我需要在每个 postbuzz div中提取并打印上面显示的每个内容。

    非常感谢您对某些骨架代码的帮助和指导! P.S - 忽略div类中的破折号。 谢谢!

2 个答案:

答案 0 :(得分:3)

您应该能够以与父soup相同的方式使用您的结果:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
post_buzz = div.findAll("div",{"class":"postbuzz"})

但是在这样做之前我遇到了错误,因此作为辅助方法,你可以做出某种sub_soup

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
sub_soup = bs(str(div))
post_buzz = sub_soup.findAll("div",{"class":"postbuzz"})

答案 1 :(得分:1)

首先阅读BeautifulSoup文档http://www.crummy.com/software/BeautifulSoup/bs4/doc/

其次,这是一个让你前进的小例子:

from bs4 import BeautifulSoup as bs

soup = bs(your_html_content)

# for fbbuzzresult
buzz = soup.findAll("div", {"id" : "fbbuzzresult"})[0]

# to get postbuzz
pbuzz = buzz.findAll("div", {"class" : "postbuzz"})

"""pbuzz is now an array with the postbuzz divs
   so now you can iterate through them, get
   the contents, keep traversing the DOM with BS 
   or do whatever you are trying to do

   So say you want the text from an element, you
   would just do: the_element.contents[0]. However
   if I'm remembering correctly you have to traverse 
   down through all of it's children to get the text.
"""