获取错误提取图像' src'用美丽的汤

时间:2014-08-12 11:19:15

标签: python html web-scraping beautifulsoup

我很难使用Python 2.7,beautifulsoup4(4.2.1)提取图像src。

我感兴趣的HTML部分是:

<div class="trb_embed_media ">  <figure imgratio="16x9" imgwidth="750" imgheight="450" data-role="imgsize_item" class="trb_embed_imageContainer_figure"><img src="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/750/16x9" data-height="450" data-width="750" data-ratio="16x9" itemprop="image" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" alt="Buzzfeed" class="trb_embed_imageContainer_img" title="Buzzfeed" data-content-naturalwidth="2048" data-content-naturalheight="1365"></figure><div class="trb_embed_related" data-role="lightbox_metadata">      <span class="trb_embed_related_title">Buzzfeed</span>  <div class="trb_embed_related_credit">Jay L. Clendenin / Los Angeles Times</div>  <div class="trb_embed_related_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.</div>  <div class="trb_embed_related_credit_and_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)</div></div>    </div>

我正在运行的代码是:

image_section = soup.find(class_ = "trb_embed_media")
print image_section
print "================="
img = image_section.find('img')['src']
print img

上面代码第2行的输出是:

<div class="trb_embed_media ">
<figure class="trb_embed_imageContainer_figure" data-role=" delayload  delayload_done imgsize_item">
<img alt="Buzzfeed" class="trb_embed_imageContainer_img" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" data-content-naturalheight="1365" data-content-naturalwidth="2048" itemprop="image" title="Buzzfeed"/>
</figure>
<div class="trb_embed_related" data-role="lightbox_metadata">
<span class="trb_embed_related_title">
         Buzzfeed
</span>
<div class="trb_embed_related_credit">
         Jay L. Clendenin / Los Angeles Times
</div>
<div class="trb_embed_related_caption">
         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.
</div>
<div class="trb_embed_related_credit_and_caption">
         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)
</div>
</div>
</div>

正如您从上面的img标签中看到的那样。它缺少src属性,即使它存在于原始HTML源代码中。我在这里想念的是什么请指教。

1 个答案:

答案 0 :(得分:1)

这是因为原始HTML源代码 包含src属性,Javascript在页面加载后添加该属性

javascript代码可能会使用data-baseurl属性生成src网址,并添加尺寸和比例。

delayload标记的imgsize_item属性中的data-role<figure>值也是一个提示。您必须根据给定的data-content-naturalheightdata-content-naturalwidth属性计算自己的宽高比,然后从那里开始。

如果您调整页面大小,您会看到该网站正在使用响应式设计;根据可用的水平空间大小加载不同的图像大小。

快速实验表明,您可以在网址中填写任何尺寸,以及任何宽高比,并且图片自动生成基于这些。

如果您想获得完整尺寸的图片,您只需加载基本网址即可;它返回未缩放的图像。

基于数据属性中高度和宽度之间的比率,16x91x19x16宽高比中的javascript used to generate size and ratio次选择:

img = soup.select('div.trb_embed_media img')[0]
width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
ratio = width / float(height)
ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
img_url = '{}/{}/{}'.format(img['data-baseurl'], width, ratio)

对于生成http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9的示例,有效图片:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.latimes.com/business/la-fi-tn-buzzfeed-deal-20140811-story.html')
>>> soup = BeautifulSoup(r.content)
>>> img = soup.select('div.trb_embed_media img')[0]
>>> width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
>>> ratio = width / float(height)
>>> ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
>>> '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
'http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9'