使用urllib2与请求下载文件:为什么这些输出不同?

时间:2016-12-14 16:43:35

标签: python python-requests urllib2

这是我今天早些时候看到的问题的后续行动。在这个问题中,用户询问从这个URL下载pdf的问题:

http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009

我认为下面的两个下载函数会给出相同的结果,但是urllib2版本下载了一些带有引用pdf加载器的脚本标记的html,而requests版本则下载了真正的pdf 。有人可以解释行为上的差异吗?

import urllib2
import requests

def get_pdf_urllib2(url, outfile='ex.pdf'):
    resp = urllib2.urlopen(url)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

requests是否足够聪明,可以在下载之前等待动态网站呈现?

修改 按照@ cwallenpoole的想法,我比较了标头,并尝试将requests请求中的标头交换到urllib2请求中。魔术头是Cookie;以下函数为示例URL写入相同的文件。

def get_pdf_urllib2(url, outfile='ex.pdf'):
    req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
    resp = urllib2.urlopen(req)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

下一个问题:requests在哪里获取该Cookie? requests是否多次访问服务器?

编辑2 Cookie来自重定向标题:

>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding

1 个答案:

答案 0 :(得分:2)

我打赌用户代理标题存在问题(我刚使用curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009并且与您使用urllib2报告相同)。这是请求标题的一部分,它允许网站知道访问该网站的程序/用户/任何类型(不是库,HTTP请求)。

By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}

如果您正在使用Mac,我会打赌该网站正在阅读Darwin/13.1.0或类似内容,然后向您提供macos相应的内容。否则,它可能会尝试将您引导至某些默认的替代内容(或阻止您抓取该网址)。