Question

我目前正在研究一个问题，该问题涉及对由从矩阵X中选择的行的所有可能组合形成的矩阵执行运算。

因为这种组合的数量可能非常大，所以我打算并行处理我的代码。具体来说，我正在使用itertools.combinations生成我的组合，然后使用itertools.islice将该迭代器切成预定数量的批次，如下所示：

def construct_batches(n,k,batch_size):

    combinations_slices = []

    # Calculate number of batches
    n_batches = math.ceil(comb(n,k,exact=True)/batch_size)

    # Construct iterator for combinations
    combinations = itertools.combinations(range(n),k)

    while len(combinations_slices) < n_batches:
        combinations_slices.append(itertools.tee(itertools.islice(combinations,batch_size)))

    return combinations_slices

获得这些切片（已复制，因此我可以将其中一个保存在磁盘上）之后，使用存储在每个切片中的组合对X进行切片，并使用{{1}对切片进行一些计算}函数，位于perform_computations函数内部。这在所有可用内核上并行发生：

evaluate_sample_score

我将def evaluate_sample_score(batch_index): # Storage structure struct = {} # Store combinations to be able to retrieve the correct sample g struct['combs'] = batches[batch_index][0] # Peform calculations using the combinations of rows score_batch = scaling_const*perform_computations(X[list(batches[batch_index][1]),:]) struct ['scores'] = score_batch # Create data structure which stores the scores for each batch along with # the combinations that generated them max_util = np.max(score_batch) # save the slice object filename = "/"+base_filename_s + "_" + str(batch_index) save_batch_scores(struct,filename,directory) return (max_util,score_batch)的结果保存在磁盘上的每个批次中，因为在大多数情况下，这些数组无法容纳在内存中，因此我需要在以后使用它们。使用它们的第一步需要计算一个数量，该数量是perform_computations在所有进程中返回的所有值的函数。我可以加载所有腌制的文件并一个接一个地执行计算，但是设法找到了一种（快得多的）替代方法：

evaluate_sample_scores

这是因为我使用def calculate_data_function(iterable): def func(scores_batch,max_score): return np.sum(np.exp(scores_batch[1]-max_score)) iter_1,iter_2 = itertools.tee(iterable) ms = functools.reduce((lambda x,y:(max(x[0],y[0]),np.zeros(shape=y[1].shape))),iter_1)[0] result = sum(map(functools.partial(func,max_score=ms),iter_2)) return result开始批处理的计算，它返回一个迭代器对象。

我的问题如下：pool.imap（）从内存角度来看如何表现，它将存储results = pool.imap(evaluate_sample_score,range(n_batches))函数的所有输出，然后将它们包装在迭代器中吗？换句话说，我是否有通过返回evaluate_sample_score中的矩阵而冒用尽内存的风险，而我最好还是在大的情况下只从磁盘上读取数据？

Python pool.imap内存管理

0 个答案: