Question

在这篇文章的投票中，我找到了一种使用Python进行流式读取的方法。

Stream large binary files with urllib2 to file。

但是当我在读取块后执行一些耗时的任务时只能获得部分前端数据是错误的。

from urllib2 import urlopen
from urllib2 import HTTPError

import sys
import time

CHUNK = 1024 * 1024 * 16


try:
     response = urlopen("XXX_domain/XXX_file_in_net.gz")
except HTTPError as e:
     print e
     sys.exit(1)


while True:
     chunk = response.read(CHUNK)

     print 'CHUNK:', len(chunk)

     #some time-consuming work, just as example
     time.sleep(60) 

     if not chunk:
            break

如果没有睡眠，则输出正确（已验证添加的总大小与实际大小相同）：

    CHUNK: 16777216
    CHUNK: 16777216
    CHUNK: 6888014
    CHUNK: 0

如果睡觉：

    CHUNK: 16777216
    CHUNK: 766580
    CHUNK: 0

然后我解压缩了这些块，发现只读取了gz文件的前部分内容。

Answer 1

如果服务器在发送所有足够的数据之前关闭了链接，请尝试支持断点续传下载。

   try:
        request =  Request(the_url, headers={'Range': 'bytes=0-'})
        response = urlopen(request, timeout = 60)
   except HTTPError as e:
        print e
        return  'Connection Error'

   print dict(response.info())
   header_dict = dict(response.info())

   global content_size
   if 'content-length' in header_dict:
        content_size = int(header_dict['content-length'])

   CHUNK = 16*1024 * 1024

   while True:
       while True:
            try:
                chunk = response.read(CHUNK )
            except socket.timeout, e:
                print 'time_out'
                break
            if not chunk:
                   break

            DoSomeTimeConsumingJob()

            global handled_size
            handled_size = handled_size + len(chunk)

       if handled_size == content_size and content_size != 0:
           break
       else:
          try:
               request =  Request(the_url, headers={'Range': 'bytes='+ str(handled_size) + '-'})
               response = urlopen(request, timeout = 60)
          except HTTPError as e:
               print e

    response.close()

使用python urlib2.open进行流式读取（逐块读取）只能得到部分结果

1 个答案: