Question

我将编写一个python程序，它从文件中读取块，处理这些块，然后将处理后的数据附加到新文件中。它需要以块的形式读取，因为要处理的文件通常会大于可用的ram数量大大简化了伪代码，它将是这样的：

def read_chunk(file_object, chunksize):
    # read the data from the file object and return the chunk
    return chunk

def process_chunk(chunk):
    #process the chunk and return the processed data
    return data

def write_chunk(data, outputfile):
    # write the data tothe output file.

def main(file):
    # This will do the work
    for i in range(0, numberofchunks, chunksize):
        chunk = read_chunk(file_obj, chunksize)
        data = process_chunk(chunk)
        write_chunk(data, out_file)

我想知道的是，我可以同时执行这三种方法吗？它会如何工作？

即一个读取数据的线程，一个处理数据的线程和一个写入数据的线程。当然，阅读线程总是需要在处理线程之前“一步”，这需要比编写线程领先一步......

真正伟大的将是能够同时执行它并在处理器之间拆分......

有关确切问题的更多细节：我将使用GDAL库从光栅文件中读取数据。这将以块/行读入一个numpy数组。处理将简单地是每个栅格单元的值与其邻居之间的一些逻辑比较（该邻居具有比测试单元低的值以及哪些是最低的）。将创建一个具有相同大小的新数组（边指定arbritary值）以保存结果，并将此数组写入新的栅格文件。我预计除了GDAL之外唯一的其他库将是numpy，这可能使这个例程成为'cythonising'的良好候选者。

有关如何进行的任何提示？

编辑：

我应该指出我们之前已经实现了类似的事情，并且我们知道与I / O相比，处理所花费的时间会很长。另一点是我们将用于读取文件的库（GDAL）将支持并发读取......

Answer 1

用于处理数据管道的协同程序？此模板应该以最小化内存配置文件的方式开始。你可以添加一个排队和虚拟的假线程＆＃39;经理对此，多个文件。

#!/usr/bin/env python3

import time
from functools import wraps, partial, reduce

def coroutine(func_gen):
    @wraps(func_gen)
    def starter(*args, **kwargs):
        cr = func_gen(*args, **kwargs)
        _ = next(cr)
        return cr
    return starter


@coroutine
def read_chunk(file_object, chunksize, target):
    """
    read enless stream with a .read method
    """
    while True:
        buf = file_object.read(chunksize)
        if not buf:
            time.sleep(1.0)
            continue
        target.send(buf)

@coroutine
def process_chunk(target):
    def example_process(thing):
        k = range(100000000) # waste time and memory
        _ = [None for _ in k]
        value = str(type(thing))
        print("%s -> %s" % (thing, value))
        return thing

    while True:
        chunk = (yield)
        data  = example_process(chunk)
        target.send(data)

@coroutine
def write_chunk(file_object):
    while True:
        writable = (yield)
        file_object.write(writable)
        file_object.flush()


def main(src, dst):
    r = open(src, 'rb')
    w = open(dst, 'wb')

    g = reduce(lambda a, b: b(a),
               [w, write_chunk, process_chunk,
                partial(read_chunk, r, 16)]
              )
    while True:
        _ = next(g)

main("./stackoverflow.py", "retval.py")

Answer 2

我诚实的建议是不要担心现在的优化（参见过早优化）。

除非您要进行大量操作（从您的描述中看起来不是这样），否则I / O等待时间很可能会很大，很多更大比处理时间（即：I / O是瓶颈）。

在这种情况下，在多个线程中进行处理是没用的。分割I / O和处理（按照您的建议）最多只会向您购买n*proc_time，n是您处理的次数，proc_time每次操作的处理时间（{1}}不包括I / O）。如果proc_time远低于I / O时间，则不会获得太多收益。

我首先按顺序执行此操作，检查I / O和CPU使用情况，只有然后担心优化，如果它看起来可能有利。您可能还想尝试一次从文件中读取更多块（缓冲）。

并发i / o和python中的大数据文件处理

2 个答案: