Question

我用np.memmap创建了一个127,545 x 1,234,620数组（我称其为特征矩阵）。如果我的数学正确（127,545 1,234,620 4/8 / 1,000,000,000），则我的阵列应为78.73 GB，每个单元4位。但是，从属性上看，它是586 GB。我肯定在这里错过了一些东西。但是，我很高兴它能正常工作，现在已经创建了。创建大约需要2个小时。

无论如何，我现在需要填充它，我非常担心这样做会花费多长时间。基本上，我有一个代表每个行的唯一整数（因此127,545个整数存储在set（）中）和一个代表每列的整数集合（因此有1,234,620个整数存储在字典中）。如果行整数在列整数集合中，则该行/列单元格将为1，否则为0。

在如何尽可能有效地做到这一点方面，我会喜欢一些建议。

创建数组的代码：

import numpy as np
nrows, ncols = 127545, 1234620
char_array = np.memmap(r"H:\Python\chararray.dat", dtype=np.float32, mode='w+', shape=(nrows, ncols))

当前要填充的代码是：

s=0
for shingle_set in make_shingle_sets.values():  # the sets of values in the dictionary
    r=0
    for row in universal_shingle_set:   # the set() of unique integers representing the rows
        if row in shingle_set:
            characteristic_matrix[r,s] = 1
        r+=1
    s+=1

创建numpy.memmap数组然后填充它的最快方法

0 个答案: