Question

对于一个相对简单的问题，我遇到了令人困惑的大内存需求。

我有一个长度为N的有序数组（索引对应于样本ID），包含整数值或NaN。

我想生成一个维数N乘以N的指标矩阵，这样如果两个样本 i 和 j ，两者都在原始列表中具有非NaN值，然后矩阵中的位置（i，j）为1和0否则（因为矩阵是对称的，我不关心位置（j，i）。

为了回顾内存需求，我实现了以下代码，它不是生成方阵，而是创建一个表示压缩方阵（即方形将生成）的数组。但是对于66,000个条目的初始列表，此脚本需要超过80GB的内存！我认为这是因为 get_condensed_indeces 中的地图线而失败，但我不知道如何修复它。如果有人有任何减少内存使用的建议请分享！

下面的代码，应该适用于任何输入数组。

def ind_matrix(x):
    ind = np.array([0.] * (len(x) * (len(x) - 1) / 2), dtype=np.float32)
    mask = np.where(~np.isnan(x))[0]
    targets = get_condensed_indeces(len(x), mask)
    ind[targets] += 1
    return ind

def get_condensed_indeces(n, desired_elements):
    # args:
    # n - number of cells in the current cluster
    # desired_elements - list of numpy indeces that specify
    # cells in a given cluster
    return map(
        index_converter,
        [[n, x[0], x[1]] for x in itertools.combinations(desired_elements, 2)]
    )

def index_converter(x):
    # mapping from position (i,j) in square matrix to index in squareform 1D array
    n, i, j = x[0], x[1], x[2]
    return n * i - (i * (i + 1)) / 2 + j - 1 - i

如何在python中从非常大的数据集创建指标矩阵

0 个答案: