Question

有人可以帮助我使用漂亮的汤蟒从下面的示例html中提取一些数据吗？这些是我想要提取的：

href html链接：示例 /movies/watch-malayalam-movies-online/6106-watch-buddy.html
具有电影名称的替代文字：好友2013年马拉雅拉姆语电影
缩略图：示例http://i44.tinypic.com/2lo14b8.jpg

（这些有多次出现..）

完整资料来源：http：\\ olangal.com

示例html：

 <div class="item column-1">
  <h2>
   <a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
    Buddy
   </a>
  </h2>
  <ul class="actions">
   <li class="email-icon">
    <a href="/component/mailto/?tmpl=component&amp;template=beez_20&amp;link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
     <img src="/media/system/images/emailButton.png" alt="Email" />
    </a>
   </li>
  </ul>
  <img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
  <p class="readmore">
   <a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
    Read more...
   </a>
  </p>
  <div class="item-separator">
  </div>
 </div>
 <div class="item column-2">
  <h2>
   <a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
    Pigman
   </a>
  </h2>
  <ul class="actions">
   <li class="email-icon">
    <a href="/component/mailto/?tmpl=component&amp;template=beez_20&amp;link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
     <img src="/media/system/images/emailButton.png" alt="Email" />
    </a>
   </li>
  </ul>
  <img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
  <p class="readmore">
   <a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
    Read more...
   </a>
  </p>
  <div class="item-separator">
  </div>
 </div>

更新：最后在@kroolik的帮助下破解了它。谢谢你。

这对我有用：

for eachItem in soup.findAll("div", { "class":"item" }):
     eachItem.ul.decompose()

     imglinks = eachItem.find_all('img')
     for imglink in imglinks:
          imgfullLink = imglink.get('src').strip()

     links = eachItem.find_all('a')
     for link in links:
          names = link.contents[0].strip()
          fullLink = "http://olangal.com"+link.get('href').strip()
          print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink

Answer 1

您可以使用以下内容获取<img width="110">和<p class="read more">：

for div in soup.find_all(class_='item'):
    # Will match `<p class="readmore">...</p>` that is direct
    # child of the div.
    p = div.find(class_='readmore', recursive=False)

    # Will print `href` attribute of the first `<a>` element
    # inside `p`.
    print p.a['href']

    # Will match `<img width="110">` that is direct child
    # of the div.
    img = div.find('img', width=110, recursive=False)

    print img['src'], img['alt']

请注意，这是针对最新的Beautiful Soup版本。

Answer 2

我通常使用PyQuery进行此类报废，它简洁明了。您可以直接使用jQuery选择器。例如，为了看到你的名字和声誉，我只需写一些像

这样的东西

from pyquery import PyQuery as pq

d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()

因此，除非你不能从bsoup切换，否则我强烈建议切换到PyQuery或一些支持XPath的库。

如何使用美丽的汤蟒蛇提取href，alt和imgsrc

2 个答案: