Python:使用附加的输入和输出文件提供和解析与外部程序之间的数据流

时间:2015-07-27 13:34:26

标签: python asynchronous io legacy

问题: 我有一个设计糟糕的Fortran程序(我无法更改它,我坚持使用它)从stdin和其他输入文件中获取文本输入,并将文本输出结果写入stdout和其他输出文件。输入和输出的大小非常大,我想避免写入硬盘驱动器(慢速操作)。我编写了一个函数,迭代几个输入文件的行,我也有多个输出的解析器。我真的不知道程序是先读取所有输入然后开始输出,还是在读取输入时开始输出。

目标: 拥有一个为外部程序提供所需功能的函数,并在程序中解析输出,而无需将数据写入硬盘驱动器上的文本文件。

研究: 使用文件的天真方式是:

from subprocess import PIPE, Popen

def execute_simple(cmd, stdin_iter, stdout_parser, input_files, output_files):

    for filename, file_iter in input_files.iteritems():
        with open(filename ,'w') as f:
            for line in file_iter:
                f.write(line + '\n')


    p_sub = Popen(
        shlex.split(cmd),
        stdin = PIPE,
        stdout = open('stdout.txt', 'w'),
        stderr = open('stderr.txt', 'w'),
        bufsize=1
    )
    for line in stdin_iter:
        p_sub.stdin.write(line + '\n')

    p_sub.stdin.close()
    p_sub.wait()

    data = {}
    for filename, parse_func in output_files.iteritems():
        # The stdout.txt and stderr.txt is included here
        with open(filename,'r') as f:
            data[filename] = parse_func(
                    iter(f.readline, b'')
            )
    return data

我已尝试和subprocess模块一起执行外部程序。使用命名管道和multiprocessing处理其他输入/输出文件。我想用一个迭代器(返回输入行)来输入stdin,将stderr保存在列表中,并解析stdout来自外部程序。输入和输出可能非常大,因此使用communicate是不可行的。

我在格式上有一个解析器:

def parser(iterator):
    for line in iterator:
        # Do something
        if condition:
            break
    some_other_function(iterator)
    return data

我使用select查看了这个solution来选择合适的流,但我不知道如何使用我的stdout解析器以及如何提供stdin。

我也看asyncio模块,但正如我所看到的那样,解析stout会遇到同样的问题。

2 个答案:

答案 0 :(得分:6)

You should use named pipes for all input and output to the Fortran program to avoid writing to disk. Then, in your consumer, you can use threads to read from each of the program's output sources and add the information to a Queue for in-order processing.

To model this, I created a python app daemon.py that reads from standard input and returns the square root until EOF. It logs all input to a log file specified as a command-line argument and prints the square root to stdout and all errors to stderr. I think it simulates your program (of course the number of output files is only one, but it can be scaled). You can view the source code for this test application here. Note the explicit call to stdout.flush(). By default, the standard output is print buffered, which means that this is output at the end and messages will not arrive in order. I hope your Fortran application does not buffer its output. I believe that my sample application will probably not run on Windows, due to a Unix-only use of select, which shouldn't matter in your case.

I have my consumer application which starts the daemon application as a subprocess, with stdin, stdout and stderr redirected to subprocess.PIPEs. each of these pipes is given to a different thread, one to give input, and three to handle the log file, errors and standard output respectively. They all add their messages to a shared Queue which your main thread reads from and sends to your parser.

This is my consumer's code:

import os, random, time
import subprocess
import threading
import Queue
import atexit

def setup():
    # make a named pipe for every file the program should write
    logfilepipe='logpipe'
    os.mkfifo(logfilepipe)

def cleanup():
    # put your named pipes here to get cleaned up
    logfilepipe='logpipe'
    os.remove(logfilepipe)

# run our cleanup code no matter what - avoid leaving pipes laying around
# even if we terminate early with Ctrl-C
atexit.register(cleanup)

# My example iterator that supplies input for the program. You already have an iterator 
# so don't worry about this. It just returns a random input from the sample_data list
# until the maximum number of iterations is reached.
class MyIter():
    sample_data=[0,1,2,4,9,-100,16,25,100,-8,'seven',10000,144,8,47,91,2.4,'^',56,18,77,94]
    def __init__(self, numiterations=1000):
        self.numiterations=numiterations
        self.current = 0

    def __iter__(self):
        return self

    def next(self):
        self.current += 1
        if self.current > self.numiterations:
            raise StopIteration
        else:
            return random.choice(self.__class__.sample_data)

# Your parse_func function - I just print it out with a [tag] showing its source.
def parse_func(source,line):
    print "[%s] %s" % (source,line)

# Generic function for sending standard input to the problem.
# p - a process handle returned by subprocess
def input_func(p, queue):
    # run the command with output redirected
    for line in MyIter(30): # Limit for testing purposes
        time.sleep(0.1) # sleep a tiny bit
        p.stdin.write(str(line)+'\n')
        queue.put(('INPUT', line))
    p.stdin.close()
    p.wait()

    # Once our process has ended, tell the main thread to quit
    queue.put(('QUIT', True))

# Generic function for reading output from the program. source can either be a
# named pipe identified by a string, or subprocess.PIPE for stdout and stderr.
def read_output(source, queue, tag=None):
    print "Starting to read output for %r" % source
    if isinstance(source,str):
        # Is a file or named pipe, so open it
        source=open(source, 'r') # open file with string name
    line = source.readline()
    # enqueue and read lines until EOF
    while line != '':
        queue.put((tag, line.rstrip()))
        line = source.readline()

if __name__=='__main__':
    cmd='daemon.py'

    # set up our FIFOs instead of using files - put file names into setup() and cleanup()
    setup()

    logfilepipe='logpipe'

    # Message queue for handling all output, whether it's stdout, stderr, or a file output by our command
    lq = Queue.Queue()

    # open the subprocess for command
    print "Running command."
    p = subprocess.Popen(['/path/to/'+cmd,logfilepipe],
                                    stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    # Start threads to handle the input and output
    threading.Thread(target=input_func, args=(p, lq)).start()
    threading.Thread(target=read_output, args=(p.stdout, lq, 'OUTPUT')).start()
    threading.Thread(target=read_output, args=(p.stderr, lq, 'ERRORS')).start()

    # open a thread to read any other output files (e.g. log file) as named pipes
    threading.Thread(target=read_output, args=(logfilepipe, lq, 'LOG')).start()

    # Now combine the results from our threads to do what you want
    run=True
    while(run):
        (tag, line) = lq.get()
        if tag == 'QUIT':
            run=False
        else:
            parse_func(tag, line)

My iterator returns a random input value (some of which are junk to cause errors). Yours should be a drop-in replacement. The program will run until the end of its input and then wait for the subprocess to complete before enqueueing a QUIT message to your main thread. My parse_func is obviously super simple, simply printing out the message and its source, but you should be able to work with something. The function to read from an output source is designed to work with both PIPEs and strings - don't open the pipes on your main thread because they will block until input is available. So for file readers (e.g. reading log files), it's better to have the child thread open the file and block. However, we spawn the subprocess on the main thread so we can pass the handles for stdin, stdout and stderr to their respective child threads.

Based partially on this Python implementation of multitail.

答案 1 :(得分:0)

如果您在发送新作业之前等待结果结束,那么Fortran程序在每个作业结束时调用 flush 非常重要(也可以经常使用)。<登记/> 该命令取决于编译器,例如GNU fortran CALL FLUSH(unitnumber)或者可以通过关闭outpud进行模拟,然后再次打开以进行追加。

您还可以轻松地在末尾写入一些带有许多空白字符的空行,以填充缓冲区大小,并获得新的数据块。 5000个空白字符可能已经足够好了,但不会太多,它会阻塞Fortran一侧的管道。如果您在发送新作业后立即读取这些空行,则甚至不需要非阻塞读取。可以在数字应用程序中轻松识别作业的最后一行。如果你要写一个“聊天”应用程序,你需要其他人写的东西。