Question

我有一个很大的定界文件。我需要将一个函数应用于此文件中的每一行，每个函数调用都需要一段时间。因此，我已将主文件分片为子文件，例如<shard-dir>/lines_<start>_<stop>.tsv和正在通过pool.starmap将功能应用于每个文件。由于我还希望保留结果，因此我将结果写入相应的输出文件<output-shard-dir>/lines_<start>_<stop>_results.tsv。

我正在映射的函数如下所示：

# this is pseudo-code, but similar to what I am using
def process_shard_file(file):
    output_file = output_filename_from_shard_filename(file)
    with open(file, 'r') as fi, open(output_file, 'w') as fo:
        result = heavy_computation_function(fi.readline())
        fo.write(stringify(result))

然后通过类似的方法开始进行多处理：

shard_files = [...] # a lot of filenames

with Pool(processes=os.cpu_count()) as pool:
    sargs = [(fname,) for fname in shard_files]
    pool.starmap(process_shard_file, sargs)

使用htop监视计算机资源时，我看到所有内核都已满负荷运转。但是，我注意到内存使用量一直在不断增加，直到达到交换...然后直到交换也满为止。

我不明白为什么发生这种情况，因为n * cpu_cores中的多个文件（process_shard_file）已成功完成。那么为什么内存不稳定？假设heavy_computation_function使用基本上相等的内存而不考虑文件，并且result的大小也相同

更新


def process_shard_file(file):
    output_file = output_filename_from_shard_filename(file)
    with open(file, 'r') as fi, open(output_file, 'w') as fo:
        result = fi.readline()# heavy_computation_function(fi.readline())
        fo.write(result)

上面的

似乎没有引起此内存泄漏的问题，其中来自{br1的result heavy_computation_function可以认为基本上是另一行要写入输出文件。

heavy_computation_function是什么样子？


def heavy_computation_function(fileline):
    numba_input = convert_line_to_numba_input(fileline)
    result = cached_njitted_function(numba_input)
    return convert_to_more_friendly_format(result)

我知道这仍然很模糊，但是我试图看看这是否是一个普遍的问题。我还尝试过将maxtasksperchild=1的选项添加到我的Pool中来真正地尝试防止泄漏无济于事。

Answer 1

您的程序有效，但只持续了一小会儿因资源泄漏而自毁之前。一个人可以选择接受这种未经诊断的泄漏作为今天不会改变的事实。

documentation指出有时会发生泄漏，并提供了一个 maxtasksperchild 参数来帮助处理它。将其设置得足够高，以便从摊销初始启动中受益几个任务的成本，但足够低以避免交换。让我们知道如何为您工作。

多处理子任务继续泄漏更多内存

更新

1 个答案: