使用python下载文件(REST URL)

时间:2013-09-21 20:47:21

标签: python http cookies request urllib

我正在尝试编写一个脚本,该脚本将从具有REST URL的网站下载一堆文件。

这是GET请求:

GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close

如果请求是好的,它将返回302响应,例如:

HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

我需要脚本做的是检查它是否是302响应。如果不是,它将“通过”,如果是,它将需要解析这里显示的位置参数:

location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6

一旦我有了location参数,我将不得不另外发出GET请求来下载该文件。我还必须为我的会话维护cookie才能下载文件。

有人能指出我最适合使用哪个库的正确方向吗?我无法找到如何解析302响应并添加一个cookie值,如上面我的GET请求中显示的那样。我确信必须有一些库可以完成所有这些。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

import urllib.request as ur
import urllib.error as ue

'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to 
the next amt bytes. This is because there is no way for urlopen() to automatically determine 
the encoding of the byte stream it receives from the http server. 
'''

url = "http://www.example.org/images/{}.jpg"

dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
    for x in arr:
        print(str(x)+"). ".ljust(4),end="")
        hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
        fh = open(dst+str(x)+".jpg","b+w")
        fh.write(hrio.read())
        fh.close()
        print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
    print("\t[REQUEST INCOMPLETE]\t",end="")
    print("<Error ~ [{}]>".format(e))