urllib.urlretrieve和urllib2破坏文件

时间:2014-10-22 16:08:27

标签: python python-2.7 urllib2

我遇到了一个令人沮丧的绊脚石,我正在研究XBMC扩展。

总之,如果我使用Firefox,IE等下载文件,那么该文件是有效的并且工作正常但是如果我在python中使用urlib或urlib2那么该文件已损坏。

有问题的文件是:http://re.zoink.it/00b007c479(007960DAD4832AC714C465E207055F2BE18CAFF6.torrent)

以下是校验和:

PY: 2d1528151c62526742ce470a01362ab8ea71e0a7
IE: 60a93c309cae84a984bc42820e6741e4f702dc21

校验和不匹配(Python DL已损坏,IE / FF DL未损坏)

这是我为完成此任务而编写的功能

def DownloadFile(uri, localpath):
  '''Downloads a file from the specified Uri to the local system.

  Keyword arguments:
  uri -- the remote uri to the resource to download
  localpath -- the local path to save the downloaded resource 
  '''
  remotefile = urllib2.urlopen(uri)
  # Get the filename from the content-disposition header
  cdHeader = remotefile.info()['content-disposition']

  # typical header looks like: 'attachment;   filename="Boardwalk.Empire.S05E00.The.Final.Shot.720p.HDTV.x264-BATV.[eztv].torrent"'
  # use RegEx to slice out the part we want (filename)
  filename = re.findall('filename=\"(.*?)\"', cdHeader)[0]    
  filepath = os.path.join(localpath, filename)
  if (os.path.exists(filepath)):
      return

  data = remotefile.read()
  with open(filepath, "wb") as code:
    code.write(data) # this is resulting in a corrupted file

  #this is resulting in a corrupted file as well
  #urllib.urlretrieve(uri, filepath)

我做错了什么?它的命中或错过;一些源正确下载,如果我使用python下载,其他源总是会导致文件损坏。他们似乎都正确下载是我使用网络浏览器

提前致谢...

1 个答案:

答案 0 :(得分:3)

响应是Gzip编码的:

>>> import urllib2
>>> remotefile = urllib2.urlopen('http://re.zoink.it/00b007c479')
>>> remotefile.info()['content-encoding']
'gzip'

您的浏览器会为您解码,但urllib2没有。你需要先自己做这件事:

import zlib

data = remotefile.read()
if remotefile.info().get('content-encoding') == 'gzip':
    data = zlib.decompress(data, zlib.MAX_WBITS + 16)

解压缩后,数据完全适合您的SHA1签名:

>>> import zlib
>>> import hashlib
>>> data = remotefile.read()
>>> hashlib.sha1(data).hexdigest()
'2d1528151c62526742ce470a01362ab8ea71e0a7'
>>> hashlib.sha1(zlib.decompress(data, zlib.MAX_WBITS + 16)).hexdigest()
'60a93c309cae84a984bc42820e6741e4f702dc21'

您可能希望切换到使用requests module,它可以透明地处理内容编码:

>>> import requests
>>> response = requests.get('http://re.zoink.it/00b007c479')
>>> hashlib.sha1(response.content).hexdigest()
'60a93c309cae84a984bc42820e6741e4f702dc21'