我很难使用Python 2.7,beautifulsoup4(4.2.1)提取图像src。
我感兴趣的HTML部分是:
<div class="trb_embed_media "> <figure imgratio="16x9" imgwidth="750" imgheight="450" data-role="imgsize_item" class="trb_embed_imageContainer_figure"><img src="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/750/16x9" data-height="450" data-width="750" data-ratio="16x9" itemprop="image" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" alt="Buzzfeed" class="trb_embed_imageContainer_img" title="Buzzfeed" data-content-naturalwidth="2048" data-content-naturalheight="1365"></figure><div class="trb_embed_related" data-role="lightbox_metadata"> <span class="trb_embed_related_title">Buzzfeed</span> <div class="trb_embed_related_credit">Jay L. Clendenin / Los Angeles Times</div> <div class="trb_embed_related_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.</div> <div class="trb_embed_related_credit_and_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)</div></div> </div>
我正在运行的代码是:
image_section = soup.find(class_ = "trb_embed_media")
print image_section
print "================="
img = image_section.find('img')['src']
print img
上面代码第2行的输出是:
<div class="trb_embed_media ">
<figure class="trb_embed_imageContainer_figure" data-role=" delayload delayload_done imgsize_item">
<img alt="Buzzfeed" class="trb_embed_imageContainer_img" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" data-content-naturalheight="1365" data-content-naturalwidth="2048" itemprop="image" title="Buzzfeed"/>
</figure>
<div class="trb_embed_related" data-role="lightbox_metadata">
<span class="trb_embed_related_title">
Buzzfeed
</span>
<div class="trb_embed_related_credit">
Jay L. Clendenin / Los Angeles Times
</div>
<div class="trb_embed_related_caption">
Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.
</div>
<div class="trb_embed_related_credit_and_caption">
Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)
</div>
</div>
</div>
正如您从上面的img标签中看到的那样。它缺少src属性,即使它存在于原始HTML源代码中。我在这里想念的是什么请指教。
答案 0 :(得分:1)
这是因为原始HTML源代码 不包含src
属性,Javascript在页面加载后添加该属性
javascript代码可能会使用data-baseurl
属性生成src
网址,并添加尺寸和比例。
父delayload
标记的imgsize_item
属性中的data-role
和<figure>
值也是一个提示。您必须根据给定的data-content-naturalheight
和data-content-naturalwidth
属性计算自己的宽高比,然后从那里开始。
如果您调整页面大小,您会看到该网站正在使用响应式设计;根据可用的水平空间大小加载不同的图像大小。
快速实验表明,您可以在网址中填写任何尺寸,以及任何宽高比,并且图片自动生成基于这些。
如果您想获得完整尺寸的图片,您只需加载基本网址即可;它返回未缩放的图像。
基于数据属性中高度和宽度之间的比率,16x9
,1x1
和9x16
宽高比中的javascript used to generate size and ratio次选择:
img = soup.select('div.trb_embed_media img')[0]
width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
ratio = width / float(height)
ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
img_url = '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
对于生成http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9的示例,有效图片:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.latimes.com/business/la-fi-tn-buzzfeed-deal-20140811-story.html')
>>> soup = BeautifulSoup(r.content)
>>> img = soup.select('div.trb_embed_media img')[0]
>>> width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
>>> ratio = width / float(height)
>>> ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
>>> '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
'http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9'