我想使用Python Requests库从url获取文件,并将其用作post请求中的mulitpart编码文件。问题是该文件可能非常大(50MB-2GB),我不想将其加载到内存中。 (上下文here。)
以下文档中的示例(multipart,stream down和stream up)我做了类似这样的事情:
with requests.get(big_file_url, stream=True) as f:
requests.post(upload_url, files={'file': ('filename', f.content)})
但我不确定我做得对。它实际上是抛出这个错误 - 从追溯中删除:
with requests.get(big_file_url, stream=True) as f:
AttributeError: __exit__
有什么建议吗?
答案 0 :(得分:2)
正如其他答案已经指出的那样:requests
doesn't support POSTing multipart-encoded files without loading them into memory。
要使用multipart / form-data上传大文件而不将其加载到内存中,您可以使用poster
:
#!/usr/bin/env python
import sys
from urllib2 import Request, urlopen
from poster.encode import multipart_encode # $ pip install poster
from poster.streaminghttp import register_openers
register_openers() # install openers globally
def report_progress(param, current, total):
sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))
url = 'http://example.com/path/'
params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
print response.read()
它可以适用于允许GET响应对象而不是本地文件:
import posixpath
import sys
from urllib import unquote
from urllib2 import Request, urlopen
from urlparse import urlsplit
from poster.encode import MultipartParam, multipart_encode # pip install poster
from poster.streaminghttp import register_openers
register_openers() # install openers globally
class MultipartParamNoReset(MultipartParam):
def reset(self):
pass # do nothing (to allow self.fileobj without seek() method)
get_url = 'http://example.com/bigfile'
post_url = 'http://example.com/path/'
get_response = urlopen(get_url)
param = MultipartParamNoReset(
name='file',
filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
filetype=get_response.headers['Content-Type'],
filesize=int(get_response.headers['Content-Length']),
fileobj=get_response)
params = [('name', 'upload test'), param]
datagen, headers = multipart_encode(params, cb=report_progress)
post_response = urlopen(Request(post_url, datagen, headers))
print post_response.read()
此解决方案在GET响应中需要有效的Content-Length
标头(已知文件大小)。如果文件大小未知,则可以使用分块传输编码来上载多部分/表单数据内容。可以使用urllib3.filepost
库附带的requests
来实施类似的解决方案,例如,基于@AdrienF's answer,而不使用poster
。
答案 1 :(得分:1)
你无法将任何你喜欢的内容转换成python中的上下文管理器。它需要非常具体的属性才能成为一个。使用当前代码,您可以执行以下操作:
response = requests.get(big_file_url, stream=True)
post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})
使用iter_content
将确保您的文件永远不会在内存中。将使用迭代器,否则使用content
属性将文件 加载到内存中。
修改合理执行此操作的唯一方法是使用chunk-encoded uploads,例如,
post_response = requests.post(upload_url, data=response.iter_content())
如果你绝对需要进行multipart / form-data编码,那么你必须创建一个抽象层,它将构建器中的生成器和Content-Length
的{{1}}标头(提供) response
)的答案,它将具有将从生成器读取的读取属性。问题再一次是我很确定整个内容会在上传之前被读入内存。
编辑#2
您可以创建自己的生成器,自己生成len(file)
编码数据。您可以通过与块编码请求相同的方式传递它,但您必须确保设置自己的multipart/form-data
和Content-Type
标头。我没有时间草拟一个例子,但它不应该太难。
答案 2 :(得分:1)
Kenneth Reitz的GitHub repo实际上存在一个问题。 我有同样的问题(虽然我只是上传一个本地文件),我添加了一个包装类,它是一个与请求的不同部分对应的流列表,其中read()属性遍历列表并且读取每个部分,并获得标题的必要值(边界和内容长度):
# coding=utf-8
from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs
writer = codecs.lookup('utf-8')[3]
class MultipartUploadWrapper(object):
def __init__(self, files):
"""
Initializer
:param files:
A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
:type network_down_callback:
Dict
"""
super(MultipartUploadWrapper, self).__init__()
self._cursor = 0
self._body_parts = None
self.content_type_header = None
self.content_length_header = None
self.create_request_parts(files)
def create_request_parts(self, files):
request_list = []
boundary = choose_boundary()
content_length = 0
boundary_string = b'--%s\r\n' % (boundary)
for fieldname, value in iter_fields(files):
content_length += len(boundary_string)
if isinstance(value, tuple):
filename, data = value
content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
+ ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))
else:
data = value
content_disposition_string = (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
+ 'Content-Type: text/plain\r\n\r\n')
request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
content_length += len(content_disposition_string)
if hasattr(data, 'read'):
data_stream = data
else:
data_stream = BytesIO(str(data))
data_stream.seek(0,2)
data_size = data_stream.tell()
data_stream.seek(0)
request_list.append(data_stream)
content_length += data_size
end_string = b'\r\n'
request_list.append(BytesIO(end_string))
content_length += len(end_string)
request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
content_length += len(boundary_string)
# There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
# there are *any* unicode strings passed into headers as part of the requests call.
# For this reason all strings are explicitly converted to non-unicode at this point.
self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
self.content_length_header = {b'Content-Length': str(content_length)}
self._body_parts = request_list
def read(self, chunk_size=0):
remaining_to_read = chunk_size
output_array = []
while remaining_to_read > 0:
body_part = self._body_parts[self._cursor]
current_piece = body_part.read(remaining_to_read)
length_read = len(current_piece)
output_array.append(current_piece)
if length_read < remaining_to_read:
# we finished this piece but haven't read enough, moving on to the next one
remaining_to_read -= length_read
if self._cursor == len(self._body_parts) - 1:
break
else:
self._cursor += 1
else:
break
return b''.join(output_array)
因此,不是传递'files'关键字arg,而是将此对象作为'data'属性传递给Request.request对象
我已经清理了代码
答案 3 :(得分:0)
理论上你可以只是原始对象
In [1]: import requests
In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw
In [3]: raw.read(10)
Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'
In [4]: raw.read(10)
Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'
In [5]: raw.read() # take forever...
In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw
In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
Out[7]: <Response [200]>