有人可以帮助我使用漂亮的汤蟒从下面的示例html中提取一些数据吗? 这些是我想要提取的:
href html链接:示例
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
具有电影名称的替代文字:
好友2013年马拉雅拉姆语电影
缩略图:示例http://i44.tinypic.com/2lo14b8.jpg
(这些有多次出现..)
完整资料来源:http:\\ olangal.com
示例html:
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
更新:最后在@kroolik的帮助下破解了它。谢谢你。
这对我有用:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
答案 0 :(得分:3)
您可以使用以下内容获取<img width="110">
和<p class="read more">
:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
请注意,这是针对最新的Beautiful Soup版本。
答案 1 :(得分:0)
我通常使用PyQuery进行此类报废,它简洁明了。您可以直接使用jQuery选择器。例如,为了看到你的名字和声誉,我只需写一些像
这样的东西from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
因此,除非你不能从bsoup切换,否则我强烈建议切换到PyQuery或一些支持XPath的库。