Question

我正在使用python库：

https://github.com/ficusss/PyGMNormalize

用于规范化我的数据集（scRNAseq）和库文件utils.py的最后一行：

https://github.com/ficusss/PyGMNormalize/blob/master/pygmnormalize/utils.py

使用过多的内存：

np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

是否存在重写此行代码以改善内存使用的好方法？我的意思是，我可以在群集上访问200Gb RAM，并且使用matrix之类的20Gb，此行无法正常工作，但我相信应该有一种方法工作正常。

Answer 1

如果<?php $newcontent = file_get_contents("template.html"); //For Image upload $allow = array("jpg", "jpeg", "gif", "png"); $todir = 'Resources/IMG/'; if ( !!$_FILES['file']['tmp_name'] ) // is the file uploaded yet? { $info = explode('.', strtolower( $_FILES['file']['name']) ); // whats the extension of the file if ( in_array( end($info), $allow) ) // is this file allowed { if ( move_uploaded_file( $_FILES['file']['tmp_name'], $todir . basename($_FILES['file']['name'] ) ) ) { // the file has been moved correctly } } else { // error this file ext is not allowed } } ?>的所有元素>> = 0，则可以执行以下操作：

matrix

使用以下事实：将np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)以外的任何浮点数或整数都视为布尔值（0在内部执行）时，会解释为True。使您不必分别构建该大布尔矩阵。

由于您正在np.any中进行布尔索引，因此您正在创建一个临时副本，如果它在matrix[...]过程中被覆盖，则不必担心。因此，您可以使用percentile来节省更多内存。

overwrite_input = True

最后，根据您的其他架构，我建议您考虑制作mat = matrix.copy() perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True) np.array_equals(mat, matrix) # is `matrix` still the same? True的{{1}}风格，这将再次有效地减少您的内存使用量（尽管根据您的类型有一些缺点）使用）。

Answer 2

我将其作为答案，因为注释中可能有更多内容，尽管可能并不完整。有两个可疑的事情-首先，如果您的计算机具有200Gb的可用RAM，则百分位数应在20Gb矩阵上运行良好。那有很多内存，所以开始研究可能还会使用它的东西。从top开始-是否还有其他进程，或者您的python程序是否使用了所有这些进程？

第二个可疑之处是utils.percentile的文档与它的实际行为不符。这是您链接到的代码中的相关内容：

def percentile(matrix, p):
    """
    Estimation of percentile without zeros.
    ....
    Returns
    -------
    float
        Calculated percentile.
    """
    return np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

它的实际作用是返回针对并非全为零的行计算的（按列）百分比。编辑包含至少一个肯定元素的行。如果值是非负的，那是同一回事，但总的来说，结果将大不相同。

np.any(matrix > 0, axis=1)返回一个布尔数组以索引不全为零的行。例如

>>> np.any(array([[3, 4], [0, 0]]) > 0, axis=1)
    array([ True, False])

>>> np.any(array([[3, 4], [1, 0]]) > 0, axis=1)
    array([ True,  True])

>>> np.any(array([[3, 0], [1, 0]]) > 0, axis=1)
    array([ True,  True])

该数组用于索引matrix，该索引仅选择不全为零的行并将其返回。如果您不熟悉这种索引编制方法，则应该阅读the numpy docs for indexing。

计算会占用大量内存-matrix > 0创建一个与矩阵尺寸相同的布尔数组，然后索引创建了matrix的副本，该副本可能包含大多数行。
因此，布尔数组大概为2-4Gb，而副本则接近20Gb。

可以减少，

## Find rows with all zeros, one row at a time to reduce memory
mask = [np.any(r > 0) for r in matrix]  
 ## Find percentile for each column, excluding rows with all zeros
perc = [np.percentile(c[mask], p) for c in matrix.T]

但是，如前所述，与功能文档不匹配。

这种逻辑可能是有原因的，但这很奇怪。如果您不知道原因，可以直接调用np.percentile-只需检查一下它是否为较小的数据子集返回一个闭合值即可。还有nanpercentile，可以用相同的方式使用，但忽略nan值。
您可以使用布尔索引来替换nan（即matrix[matrix < 0] = np.nan）中不想包含的值，然后调用它。

减少使用numpy的代码行的内存使用量

2 个答案: