在请求中传递标头的影响?

时间:2017-01-12 10:52:53

标签: image python-2.7 web-scraping request

我想知道在requests.get中传递标题时的差异,即requests.get(url, headers)requests.get(url)之间的差异。

我有这两段代码:

from lxml import html
from lxml import etree
import requests
import re

url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"

page = requests.get(url)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]

这个输出是你所期望的网址。但是这个:

from lxml import html
from lxml import etree
import requests
import re

url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)

tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]

的输出以data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoIC开头 我猜这是没有渲染的实际图像,只是普通数据。知道我怎么能保持它的网址形式?在其他方面,标题的存在会影响我们得到的响应吗?

谢谢

1 个答案:

答案 0 :(得分:1)

保存第一个代码对html文件的响应并在浏览器中打开: enter image description here

如您所见,亚马逊禁止您使用标题。

使用此xpath:

XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@data-old-hires'

出:

type:  <class 'lxml.etree._ElementStringResult'>
http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg

这是原始的html数据:

<img alt=".." src="&#10;data:image/webp;base64,UklGRuYIAABXRUJQVlA4INoIAACQQQCdASosAcsAPrFWpEqkIqQhIxN6gIgWCek6r4bUf/..." 
data-old-hires="http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg"

图片网址位于data-old-hires属性中。