Beautifulsoup解析,从子标签获取信息

时间:2014-03-24 06:16:53

标签: python parsing

我有以下“网站”(这是html的一部分):

<div class="moduleBody">
     <div class="feature">
     <div class="feature">
         <h2>
             <a href="somelink">sometext</a>
         </h2>
         <div class="relatedInfo">
              <span class="relatedTopics">
              <span class="timestamp">22 Mar 2014</span>
         </div>
      </div>
</div> 

我想提取 sometext somelink 。为此,我编写了python代码,这里是:

for links in soup.find_all('div','moduleBody'):
        for link in links.find_all('div','feature'):
            if not("video" in (link['href'])):
                print "Name: "+link.text
                #sibling_page=urllib2.urlopen("major_link"+link['href'])
                print " Link extracted: "+link['href']

但是,此代码不打印任何内容。你能说出我的错误在哪里吗?

2 个答案:

答案 0 :(得分:1)

您的div没有href属性。您必须在<a>元素处查看一个级别。

from bs4 import BeautifulSoup

html = """
<div class="moduleBody">
     <div class="feature">
     <div class="feature">
         <h2>
             <a href="somelink">sometext</a>
         </h2>
         <div class="relatedInfo">
              <span class="relatedTopics">
              <span class="timestamp">22 Mar 2014</span>
         </div>
      </div>
</div>
"""

soup = BeautifulSoup(html)

for links in soup.find_all("div", "moduleBody"):
    for link in links.find_all("div", "feature"):
        for a in links.find_all("a"):
            if not "video" in a['href']:
                print("Name: " + a.text)
                print("Link extracted: " + a['href'])

打印:

Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink

它找到了两次,因为你的HTML坏了。 BeautifulSoup将其修复如下:

<div class="moduleBody">
 <div class="feature">
  <div class="feature">
   <h2>
    <a href="somelink">
     sometext
    </a>
   </h2>
   <div class="relatedInfo">
    <span class="relatedTopics">
     <span class="timestamp">
      22 Mar 2014
     </span>
    </span>
   </div>
  </div>
 </div>
</div>

答案 1 :(得分:0)

在您的第二个for循环中,您的link变量保留对<h2>...</h2>的引用,该href没有属性<div class="feature">

这在很大程度上取决于您的结构,但如果<h2>标记始终以仅包含<a>标记的for links in soup.find_all('div','moduleBody'): for link in links.find_all('div','feature'): anchor_tag = link.h2.a if not 'video' in anchor_tag['href']: print 'Name: %s' % anchor_tag.text print 'Link extracted: %s' % anchor_tag['href'] 标记开头,那么您可以执行的操作是:

<div class="feature">

顺便说一句,您的HTML格式不正确,应该关闭第一个<div class="moduleBody"> <div class="feature"></div> <div class="feature"> <h2> <a href="somelink">sometext</a> </h2> <div class="relatedInfo"> <span class="relatedTopics"> <span class="timestamp">22 Mar 2014</span> </div> </div> </div> 标记。

{{1}}