AppEngine *可以拉动*超过32MB的数据吗?

时间:2015-04-29 19:57:57

标签: python google-app-engine google-cloud-storage

我正在编写一个需要摄取高清视频的API(至少100MB)。我只能通过HTTP XML Feed访问视频,所以只有拥有视频的网址,我才能视频(使用GET)。计划是将视频存储在GCS中。

但在我上传/写入GCS之前,我在AppEngine中遇到了每个请求32MB的限制。

是否存在围绕这两个限制的GAE方式:

  1. 需要成为AppEngine可以启动的GET
  2. 需要能够将数据导入GCS
  3. 我知道Amazon S3,如果我必须使用Google Cloud产品,但我不知道是否可以将其配置为提取大数据。

    谢谢。

1 个答案:

答案 0 :(得分:2)

按照保罗科林伍德的建议,我想出了以下内容。

我决定不向GCS写块然后将它们拼接在一起。相反,我选择在内存中完成所有操作,但我可能会根据资源成本进行更改(必须运行F4 @ 512MB以避免超过F2的256MB软限制)。

def get(self):
    # Work with GAE's 32MB-per-request limit, set to 30MB to stay under
    RANGE = 30*(1024**2)

    url = self.request.get('url')
    request = urllib2.Request(url)

    request.get_method = lambda: 'HEAD'
    response = urllib2.urlopen(request)
    info = response.info()
    logging.debug('Downloading {}B video'.format(info.get('Content-length')))

    request.get_method = lambda: 'GET'
    _buffer = ''
    start = 0
    while True:
        end = start + RANGE
        request.headers['Range'] = 'bytes={}-{}'.format(start, end)
        logging.debug('Buffering bytes {} to {}'.format(start, end))
        _bytes = urllib2.urlopen(request, timeout=60).read()
        _buffer += _bytes
        logging.info('Buffered bytes {} to {}'.format(start, end))

        # If there are less bytes than requested then all bytes
        # have been received, break to avoid an HTTP 416
        if len(_bytes) < (end - start):
            break

        start += (RANGE + 1)

    filename = '/MY-BUCKET/video/test_large.mp4'
    with gcs.open(filename, 'w', content_type='video/mp4') as f:
        f.write(_buffer)
    logging.info('Wrote {}B video to GCS'.format(len(_buffer)))

在日志中看起来像这样:

DEBUG    2015-05-01 02:02:00,947 video.py:27] Buffering bytes 0 to 31457280
INFO     2015-05-01 02:02:11,625 video.py:30] Buffered bytes 0 to 31457280
DEBUG    2015-05-01 02:02:11,625 video.py:27] Buffering bytes 31457281 to 62914561
INFO     2015-05-01 02:02:22,768 video.py:30] Buffered bytes 31457281 to 62914561
DEBUG    2015-05-01 02:02:22,768 video.py:27] Buffering bytes 62914562 to 94371842
INFO     2015-05-01 02:02:32,920 video.py:30] Buffered bytes 62914562 to 94371842
...
Writing to GCS
...
INFO     2015-05-01 02:02:41,274 video.py:42] Wrote 89635441B video to GCS

更新,6 / May / 15

按照Kekito的建议,我将GCS写入循环,在整个持续时间内保持文件句柄处于打开状态。

url = self.request.get('url')

request = urllib2.Request(url)
request.get_method = lambda: 'HEAD'
response = urllib2.urlopen(request)
info = response.info()
content_length = int(info.get('Content-length'))
logging.debug('Downloading {}B video'.format(content_length))

del(info)
del(response)
del(request)

request = urllib2.Request(url)
start = 0
filename = '/MY-BUCKET/video/test_large.mp4'
f = gcs.open(filename, 'w', content_type='video/mp4')
while True:
    end = start + RANGE
    request.headers['Range'] = 'bytes={}-{}'.format(start, end)

    f.write(urllib2.urlopen(request, timeout=60).read())

    if end >= content_length:
        break

    start = end + 1

f.close()

根据建议here,我使用top来监控运行GAE本地开发服务器的Python进程,开始上传,并在下载和上传周期之间记录内存占用。

我还尝试改变一次处理块的大小:将块大小从30 MB减少到20 MB,将最大内存使用量减少了约50 MB。在下面的图表中,正在摄取560 MB文件,我正在尝试跟踪:

  1. GC :内存使用率最低,urlopen() G C 数据块
  2. WC :峰值内存使用量,而f.write() W C hunk引用至GCS
  3. enter image description here

    20 MB-Chunk-Test最大输出为230 MB,而30-MB-Chunk-Test最大输出为281 MB。所以,我可以只运行256 MB的实例,但可能会感觉更好,运行在512 MB。我也可以尝试更小的块大小。