这是我今天早些时候看到的问题的后续行动。在这个问题中,用户询问从这个URL下载pdf的问题:
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
我认为下面的两个下载函数会给出相同的结果,但是urllib2
版本下载了一些带有引用pdf加载器的脚本标记的html,而requests
版本则下载了真正的pdf 。有人可以解释行为上的差异吗?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
requests
是否足够聪明,可以在下载之前等待动态网站呈现?
修改
按照@ cwallenpoole的想法,我比较了标头,并尝试将requests
请求中的标头交换到urllib2
请求中。魔术头是Cookie;以下函数为示例URL写入相同的文件。
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
下一个问题:requests
在哪里获取该Cookie? requests
是否多次访问服务器?
编辑2 Cookie来自重定向标题:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
答案 0 :(得分:2)
我打赌用户代理标题存在问题(我刚使用curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
并且与您使用urllib2
报告相同)。这是请求标题的一部分,它允许网站知道访问该网站的程序/用户/任何类型(不是库,HTTP请求)。
By default, it looks like urllib2
uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
如果您正在使用Mac,我会打赌该网站正在阅读Darwin/13.1.0
或类似内容,然后向您提供macos
相应的内容。否则,它可能会尝试将您引导至某些默认的替代内容(或阻止您抓取该网址)。