在这篇文章的投票中,我找到了一种使用Python进行流式读取的方法。
Stream large binary files with urllib2 to file。
但是当我在读取块后执行一些耗时的任务时只能获得部分前端数据是错误的。
from urllib2 import urlopen
from urllib2 import HTTPError
import sys
import time
CHUNK = 1024 * 1024 * 16
try:
response = urlopen("XXX_domain/XXX_file_in_net.gz")
except HTTPError as e:
print e
sys.exit(1)
while True:
chunk = response.read(CHUNK)
print 'CHUNK:', len(chunk)
#some time-consuming work, just as example
time.sleep(60)
if not chunk:
break
如果没有睡眠,则输出正确(已验证添加的总大小与实际大小相同):
CHUNK: 16777216
CHUNK: 16777216
CHUNK: 6888014
CHUNK: 0
如果睡觉:
CHUNK: 16777216
CHUNK: 766580
CHUNK: 0
然后我解压缩了这些块,发现只读取了gz文件的前部分内容。
答案 0 :(得分:1)
如果服务器在发送所有足够的数据之前关闭了链接,请尝试支持断点续传下载。
try:
request = Request(the_url, headers={'Range': 'bytes=0-'})
response = urlopen(request, timeout = 60)
except HTTPError as e:
print e
return 'Connection Error'
print dict(response.info())
header_dict = dict(response.info())
global content_size
if 'content-length' in header_dict:
content_size = int(header_dict['content-length'])
CHUNK = 16*1024 * 1024
while True:
while True:
try:
chunk = response.read(CHUNK )
except socket.timeout, e:
print 'time_out'
break
if not chunk:
break
DoSomeTimeConsumingJob()
global handled_size
handled_size = handled_size + len(chunk)
if handled_size == content_size and content_size != 0:
break
else:
try:
request = Request(the_url, headers={'Range': 'bytes='+ str(handled_size) + '-'})
response = urlopen(request, timeout = 60)
except HTTPError as e:
print e
response.close()