使用subprocess.Popen将大量数据传递给stdin

时间:2011-05-06 12:24:23

标签: python subprocess popen

我很难理解解决这个简单问题的python方法是什么。

我的问题很简单。如果您使用以下代码,它将挂起。这在子流程模块doc。

中有详细记载
import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
for i in range(100000):
    proc.stdin.write('%d\n' % i)
output = proc.communicate()[0]
print output

正在寻找一个解决方案(有一个非常有洞察力的线程,但我现在已经丢失了)我发现这个解决方案(以及其他)使用了一个显式的分支:

import os
import sys
from subprocess import Popen, PIPE

def produce(to_sed):
    for i in range(100000):
        to_sed.write("%d\n" % i)
        to_sed.flush()
    #this would happen implicitly, anyway, but is here for the example
    to_sed.close()

def consume(from_sed):
    while 1:
        res = from_sed.readline()
        if not res:
            sys.exit(0)
            #sys.exit(proc.poll())
        print 'received: ', [res]

def main():
    proc = Popen(['cat','-'],stdin=PIPE,stdout=PIPE)
    to_sed = proc.stdin
    from_sed = proc.stdout

    pid = os.fork()
    if pid == 0 :
        from_sed.close()
        produce(to_sed)
        return
    else :
        to_sed.close()
        consume(from_sed)

if __name__ == '__main__':
    main()

虽然这个解决方案在概念上非常容易理解,但与子进程模块相比,它使用了一个更多的进程并且卡在了太低的水平上(就是为了隐藏这类东西......)。

我想知道:有没有一个简单而干净的解决方案,使用不会挂起的子进程模块或实现这种模式我必须退后一步并实现旧式选择循环或显式分叉? / p>

由于

10 个答案:

答案 0 :(得分:10)

如果您需要纯Python解决方案,则需要将读取器或编写器放在单独的线程中。 threading包是一种轻量级的方法,可以方便地访问常见对象并且不会出现混乱的分叉。

import subprocess
import threading
import sys

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
def writer():
    for i in range(100000):
        proc.stdin.write('%d\n' % i)
    proc.stdin.close()
thread = threading.Thread(target=writer)
thread.start()
for line in proc.stdout:
    sys.stdout.write(line)
thread.join()
proc.wait()

看到subprocess模块现代化以支持流和协同程序可能会很简洁,这将允许混合Python部件和shell部件的管道构造得更加优雅。

答案 1 :(得分:6)

如果您不想将所有数据保留在内存中,则必须使用select。例如。类似的东西:

import subprocess
from select import select
import os

proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

i = 0;
while True:
    rlist, wlist, xlist = [proc.stdout], [], []
    if i < 100000:
        wlist.append(proc.stdin)
    rlist, wlist, xlist = select(rlist, wlist, xlist)
    if proc.stdout in rlist:
        out = os.read(proc.stdout.fileno(), 10)
        print out,
        if not out:
            break
    if proc.stdin in wlist:
        proc.stdin.write('%d\n' % i)
        i += 1
        if i >= 100000:
            proc.stdin.close()

答案 2 :(得分:2)

这是我用来通过子进程加载6G mysql转储文件的东西。远离shell = True。不安全,开始浪费资源。

import subprocess

fhandle = None

cmd = [mysql_path,
      "-u", mysql_user, "-p" + mysql_pass],
      "-h", host, database]

fhandle = open(dump_file, 'r')
p = subprocess.Popen(cmd, stdin=fhandle, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

(stdout,stderr) = p.communicate()

fhandle.close()

答案 3 :(得分:1)

对于这种事情,shell比subprocess工作得好很多。

编写非常简单的Python应用程序,从sys.stdin读取并写入sys.stdout

使用shell管道将简单应用程序连接在一起。

如果需要,可以使用subprocess启动管道,或者只编写一行shell脚本。

python part1.py | python part2.py

这非常非常有效。只要你保持简单,它也可以移植到所有Linux(和Windows)。

答案 4 :(得分:1)

adb shell input text "blahblah"的stdout OS管道缓冲区已满,您的代码就会死锁。如果您使用->|;你必须及时消耗它,否则你的情况就会发生僵局。

如果您在流程运行时不需要输出;你可以将它重定向到一个临时文件:

cat

如果输入/输出很小(适合内存);您可以一次性传递输入并使用同时为您读取/写入的stdout=PIPE一次性获取输出:

#!/usr/bin/env python3
import subprocess
import tempfile

with tempfile.TemporaryFile('r+') as output_file:
    with subprocess.Popen(['cat'],
                          stdin=subprocess.PIPE,
                          stdout=output_file,
                          universal_newlines=True) as process:
        for i in range(100000):
            print(i, file=process.stdin)
    output_file.seek(0)  # rewind (and sync with the disk)
    print(output_file.readline(), end='')  # get  the first line of the output

要手动并发读/写,您可以使用线程,asyncio,fcntl等@Jed provided a simple thread-based solution。这是基于.communicate()的解决方案:

#!/usr/bin/env python3
import subprocess

cp = subprocess.run(['cat'], input='\n'.join(['%d' % i for i in range(100000)]),
                    stdout=subprocess.PIPE, universal_newlines=True)
print(cp.stdout.splitlines()[-1]) # print the last line

在Unix上,您可以使用基于asyncio的解决方案:

#!/usr/bin/env python3
import asyncio
import sys
from subprocess import PIPE

async def pump_input(writer):
     try:
         for i in range(100000):
             writer.write(b'%d\n' % i)
             await writer.drain()
     finally:
         writer.close()

async def run():
    # start child process
    # NOTE: universal_newlines parameter is not supported
    process = await asyncio.create_subprocess_exec('cat', stdin=PIPE, stdout=PIPE)
    asyncio.ensure_future(pump_input(process.stdin)) # write input
    async for line in process.stdout: # consume output
        print(int(line)**2) # print squares
    return await process.wait()  # wait for the child process to exit


if sys.platform.startswith('win'):
    loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
    asyncio.set_event_loop(loop)
else:
    loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()

答案 5 :(得分:0)

以下是使用管道从gzip一次读取一条记录的示例(Python 3):

cmd = 'gzip -dc compressed_file.gz'
pipe = Popen(cmd, stdout=PIPE).stdout

for line in pipe:
    print(":", line.decode(), end="")

我知道有一个标准模块,它只是作为一个例子。你可以使用通信方法一次性读取整个输出(比如shell back-ticks),但显然你要注意内存大小。

以下是在Linux上将记录写入lp(1)程序的示例(再次使用Python 3):

cmd = 'lp -'
proc = Popen(cmd, stdin=PIPE)
proc.communicate(some_data.encode())

答案 6 :(得分:0)

现在我知道这不会完全满足你的纯粹主义者,因为输入必须适合内存,你没有选择与输入输出交互工作,但至少这在你的例子上工作正常。通信方法可选择将输入作为参数,如果您以这种方式为进程提供输入,它将起作用。

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )

input = "".join('{0:d}\n'.format(i) for i in range(100000))
output = proc.communicate(input)[0]
print output

对于更大的问题,您可以继承Popen,重写__init__以接受类似流的对象作为stdin,stdout,stderr的参数,并重写_communicate方法(对于跨平台来说,毛茸茸的,你需要执行两次,请参阅subprocess.py源)以调用stdin流上的read()并将输出write()写入stdout和stderr流。让我对这种方法感到困扰的是,据我所知,它还没有完成。当以前没有做过明显的事情时,通常有一个原因(它没有按预期工作),但我不明白为什么它不应该,除了你需要流在Windows中是线程安全的事实

答案 7 :(得分:0)

使用aiofiles&amp; python 3.5中的asyncio:

有点复杂,但在stdin中只需要1024字节内存!

import asyncio
import aiofiles
import sys
from os.path import dirname, join, abspath
import subprocess as sb


THIS_DIR = abspath(dirname(__file__))
SAMPLE_FILE = join(THIS_DIR, '../src/hazelnut/tests/stuff/sample.mp4')
DEST_PATH = '/home/vahid/Desktop/sample.mp4'


async def async_file_reader(f, buffer):
    async for l in f:
        if l:
            buffer.append(l)
        else:
            break
    print('reader done')


async def async_file_writer(source_file, target_file):
    length = 0
    while True:
        input_chunk = await source_file.read(1024)
        if input_chunk:
            length += len(input_chunk)
            target_file.write(input_chunk)
            await target_file.drain()
        else:
            target_file.write_eof()
            break

    print('writer done: %s' % length)


async def main():
    dir_name = dirname(DEST_PATH)
    remote_cmd = 'ssh localhost mkdir -p %s && cat - > %s' % (dir_name, DEST_PATH)

    stdout, stderr = [], []
    async with aiofiles.open(SAMPLE_FILE, mode='rb') as f:
        cmd = await asyncio.create_subprocess_shell(
            remote_cmd,
            stdin=sb.PIPE,
            stdout=sb.PIPE,
            stderr=sb.PIPE,
        )

        await asyncio.gather(*(
            async_file_reader(cmd.stdout, stdout),
            async_file_reader(cmd.stderr, stderr),
            async_file_writer(f, cmd.stdin)
        ))

        print('EXIT STATUS: %s' % await cmd.wait())

    stdout, stderr = '\n'.join(stdout), '\n'.join(stderr)

    if stdout:
        print(stdout)

    if stderr:
        print(stderr, file=sys.stderr)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

结果:

writer done: 383631
reader done
reader done
EXIT STATUS: 0

答案 8 :(得分:0)

我能想到的最简单的解决方案:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=1)
Thread(target=lambda: any((p.stdin.write(b) for b in s)) or p.stdin.close()).start()
print (p.stdout.read())

缓冲版:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
n = 1024 # buffer size
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=n)
Thread(target=lambda: any((p.stdin.write(c) for c in (s[i:i+n] for i in xrange(0, len(s), n)))) or p.stdin.close()).start()
print (p.stdout.read())

答案 9 :(得分:0)

我正在寻找一个示例代码来逐步迭代进程输出,因为这个进程从提供的迭代器中消耗它的输入(也是递增的)。基本上是:

import string
import random

# That's what I consider a very useful function, though didn't
# find any existing implementations.
def process_line_reader(args, stdin_lines):
    # args - command to run, same as subprocess.Popen
    # stdin_lines - iterable with lines to send to process stdin
    # returns - iterable with lines received from process stdout
    pass

# Returns iterable over n random strings. n is assumed to be infinity if negative.
# Just an example of function that returns potentially unlimited number of lines.
def random_lines(n, M=8):
    while 0 != n:
        yield "".join(random.choice(string.letters) for _ in range(M))
        if 0 < n:
            n -= 1

# That's what I consider to be a very convenient use case for
# function proposed above.
def print_many_uniq_numbered_random_lines():
    i = 0
    for line in process_line_reader(["uniq", "-i"], random_lines(100500 * 9000)):
        # Key idea here is that `process_line_reader` will feed random lines into
        # `uniq` process stdin as lines are consumed from returned iterable.
        print "#%i: %s" % (i, line)
        i += 1

这里建议的一些解决方案允许使用线程(但并不总是方便)或使用asyncio(Python 2.x中不可用)。以下是允许执行此操作的实现示例。

import subprocess
import os
import fcntl
import select

class nonblocking_io(object):
    def __init__(self, f):
        self._fd = -1
        if type(f) is int:
            self._fd = os.dup(f)
            os.close(f)
        elif type(f) is file:
            self._fd = os.dup(f.fileno())
            f.close()
        else:
            raise TypeError("Only accept file objects or interger file descriptors")
        flag = fcntl.fcntl(self._fd, fcntl.F_GETFL)
        fcntl.fcntl(self._fd, fcntl.F_SETFL, flag | os.O_NONBLOCK)
    def __enter__(self):
        return self
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def fileno(self):
        return self._fd
    def close(self):
        if 0 <= self._fd:
            os.close(self._fd)
            self._fd = -1

class nonblocking_line_writer(nonblocking_io):
    def __init__(self, f, lines, autoclose=True, buffer_size=16*1024, encoding="utf-8", linesep=os.linesep):
        super(nonblocking_line_writer, self).__init__(f)
        self._lines = iter(lines)
        self._lines_ended = False
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._buffer_offset = 0
        self._buffer = bytearray()
        self._encoding = encoding
        self._linesep = bytearray(linesep, encoding)
    # Returns False when `lines` iterable is exhausted and all pending data is written
    def continue_writing(self):
        while True:
            if self._buffer_offset < len(self._buffer):
                n = os.write(self._fd, self._buffer[self._buffer_offset:])
                self._buffer_offset += n
                if self._buffer_offset < len(self._buffer):
                    return True
            if self._lines_ended:
                if self._autoclose:
                    self.close()
                return False
            self._buffer[:] = []
            self._buffer_offset = 0
            while len(self._buffer) < self._buffer_size:
                line = next(self._lines, None)
                if line is None:
                    self._lines_ended = True
                    break
                self._buffer.extend(bytearray(line, self._encoding))
                self._buffer.extend(self._linesep)

class nonblocking_line_reader(nonblocking_io):
    def __init__(self, f, autoclose=True, buffer_size=16*1024, encoding="utf-8"):
        super(nonblocking_line_reader, self).__init__(f)
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._encoding = encoding
        self._file_ended = False
        self._line_part = ""
    # Returns (lines, more) tuple, where lines is iterable with lines read and more will
    # be set to False after EOF.
    def continue_reading(self):
        lines = []
        while not self._file_ended:
            data = os.read(self._fd, self._buffer_size)
            if 0 == len(data):
                self._file_ended = True
                if self._autoclose:
                    self.close()
                if 0 < len(self._line_part):
                    lines.append(self._line_part.decode(self._encoding))
                    self._line_part = ""
                break
            for line in data.splitlines(True):
                self._line_part += line
                if self._line_part.endswith(("\n", "\r")):
                    lines.append(self._line_part.decode(self._encoding).rstrip("\n\r"))
                    self._line_part = ""
            if len(data) < self._buffer_size:
                break
        return (lines, not self._file_ended)

class process_line_reader(object):
    def __init__(self, args, stdin_lines):
        self._p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        self._reader = nonblocking_line_reader(self._p.stdout)
        self._writer = nonblocking_line_writer(self._p.stdin, stdin_lines)
        self._iterator = self._communicate()
    def __iter__(self):
        return self._iterator
    def __enter__(self):
        return self._iterator
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def _communicate(self):
        read_set = [self._reader]
        write_set = [self._writer]
        while read_set or write_set:
            try:
                rlist, wlist, xlist = select.select(read_set, write_set, [])
            except select.error, e:
                if e.args[0] == errno.EINTR:
                    continue
                raise
            if self._reader in rlist:
                stdout_lines, more = self._reader.continue_reading()
                for line in stdout_lines:
                    yield line
                if not more:
                    read_set.remove(self._reader)
            if self._writer in wlist:
                if not self._writer.continue_writing():
                    write_set.remove(self._writer)
        self.close()
    def lines(self):
        return self._iterator
    def close(self):
        if self._iterator is not None:
            self._reader.close()
            self._writer.close()
            self._p.wait()
            self._iterator = None