Question

我有一个不可搜索的文件对象。特别是它是来自HTTP请求的不确定大小的文件。

import requests
fileobj = requests.get(url, stream=True)

我正在将此文件传输到对Amazon AWS SDK功能的调用，该功能正在将内容写入Amazon S3。这很好。

import boto3
s3 = boto3.resource('s3')
s3.bucket('my-bucket').upload_fileobj(fileobj, 'target-file-name')

但是，在将其流式传输到S3的同时，我还希望将数据流式传输到另一个进程。这个其他过程可能不需要整个流，可能会在某个时候停止收听;这很好，不应该影响到S3的流。

重要的是我不会使用太多内存，因为其中一些文件可能非常庞大。出于同样的原因，我不想把任何东西都写到磁盘上。

我不介意，如果任何一个接收器由于另一个慢速而减慢，只要S3最终得到整个文件，并且数据转到两个接收器（而不是每个接收器仍然需要它）。

在Python（3）中最好的解决方法是什么？我知道我不能将同一个文件对象传递给两个接收器，例如

s3.bucket('my-bucket').upload_fileobj(fileobj, 'target-file-name')
# At the same time somehow as
process = subprocess.Popen(['myapp'], stdin=fileobj)

我想我可以为类似文件的对象编写一个包装器，它不仅将任何数据传递给调用者（也就是S3接收器），而且还传递给另一个进程。像

这样的东西

class MyFilewrapper(object):
    def __init__(self, fileobj):
        self._fileobj = fileobj
        self._process = subprocess.Popen(['myapp'], stdin=popen.PIPE)
    def read(self, size=-1):
        data = self._fileobj.read(size)
        self._process.stdin.write(data)
        return data

filewrapper = MyFilewrapper(fileobj)
s3.bucket('my-bucket').upload_fileobj(filewrapper, 'target-file-name')

但有更好的方法吗？也许像是

streams = StreamDuplicator(fileobj, streams=2)
s3.bucket('my-bucket').upload_fileobj(streams[0], 'target-file-name')
# At the same time somehow as
process = subprocess.Popen(['myapp'], stdin=streams[1])

Answer 1

出现了MyFilewrapper解决方案的不适，因为upload_fileobj内的IO循环现在可以控制将数据提供给严格来说与上传无关的子流程。

“正确”的解决方案将涉及一个上传API，它为外部循环编写上传流提供类似文件的对象。这样就可以“干净地”将数据提供给两个目标流。

以下示例显示了基本概念。虚构的startupload方法提供了类似文件的上传对象。对于cource，您需要添加适当的错误处理等。

fileobj = requests.get(url, stream=True)

upload_fd = s3.bucket('my-bucket').startupload('target-file-name')
other_fd = ... # Popen or whatever

buf = memoryview(bytearray(4046))
while True:
    r = fileobj.read_into(buf)
    if r == 0:
        break

    read_slice = buf[:r]
    upload_fd.write(read_slice)
    other_fd.write(read_slice)

Answer 2

以下是具有请求的功能和使用模型的StreamDuplicator的实现。我确认它正确处理了其中一个接收器中途消耗相应流的情况。

<强>用法：

./streamduplicator.py <sink1_command> <sink2_command> ...

示例：

$ seq 100000 | ./streamduplicator.py "sed -n '/0000/ {s/^/sed: /;p}'" "grep 1234"

<强>输出：

sed: 10000 1234 11234 12340 12341 12342 12343 12344 12345 12346 12347 12348 12349 21234 sed: 20000 31234 sed: 30000 41234 sed: 40000 51234 sed: 50000 61234 sed: 60000 71234 sed: 70000 81234 sed: 80000 91234 sed: 90000 sed: 100000

<强> streamduplicator.py ：

#!/usr/bin/env python3 import sys import os from subprocess import Popen from threading import Thread from time import sleep import shlex import fcntl WRITE_TIMEOUT=0.1 def write_or_timeout(stream, data, timeout): data_to_write = data[:] time_to_sleep = 1e-6 time_remaining = 1.0 * timeout while time_to_sleep != 0: try: stream.write(data_to_write) return True except BlockingIOError as ex: data_to_write = data_to_write[ex.characters_written:] if ex.characters_written == 0: time_to_sleep *= 2 else: time_to_sleep = 1e-6 time_remaining = timeout time_to_sleep = min(time_remaining, time_to_sleep) sleep(time_to_sleep) time_remaining -= time_to_sleep return False class StreamDuplicator(object): def __init__(self, stream, n, timeout=WRITE_TIMEOUT): self.stream = stream self.write_timeout = timeout self.pipereadstreams = [] self.pipewritestreams = [] for i in range(n): (r, w) = os.pipe() readStream = open(r, 'rb') self.pipereadstreams.append(readStream) old_flags = fcntl.fcntl(w, fcntl.F_GETFL); fcntl.fcntl(w, fcntl.F_SETFL, old_flags|os.O_NONBLOCK) self.pipewritestreams.append(os.fdopen(w, 'wb')) Thread(target=self).start() def __call__(self): while True: data = self.stream.read(1024*16) if len(data) == 0: break surviving_pipes = [] for p in self.pipewritestreams: if write_or_timeout(p, data, self.write_timeout) == True: surviving_pipes.append(p) self.pipewritestreams = surviving_pipes def __getitem__(self, i): return self.pipereadstreams[i] if __name__ == '__main__': n = len(sys.argv) streams = StreamDuplicator(sys.stdin.buffer, n-1, 3) for (i,cmd) in zip(range(n-1), sys.argv[1:]): Popen(shlex.split(cmd), stdin=streams[i])

实施限制：

使用fcntl将管道写入文件描述符设置为非阻塞模式可能会使其在Windows下无法使用。

通过写入超时检测到已关闭/未订阅的接收器。

将不可搜索的类文件对象流式传输到多个接收器

2 个答案: