Question

我在巨大的＆amp; amp;编写机器学习算法稀疏数据（我的矩阵是形状（347,5 416 812 801），但非常稀疏，只有0.13％的数据不为零。

我的稀疏矩阵的大小是105 000字节（<1Mbytes），并且是csr类型。

我试图通过选择每个列表的示例索引来分离列车/测试集。所以我想使用以下方法将数据集分成两部分：

training_set = matrix[train_indices]

形状(len(training_indices), 5 416 812 801)，仍然稀疏

testing_set = matrix[test_indices]

形状(347-len(training_indices), 5 416 812 801)也稀疏

training_indices和testing_indices两个list int

但training_set = matrix[train_indices]似乎失败并返回Segmentation fault (core dumped)

这可能不是内存问题，因为我在具有64 GB RAM的服务器上运行此代码。

关于可能是什么原因的任何线索？

Answer 1

我认为我已经使用以下内容重新创建了csr行索引

def extractor(indices, N):
   indptr=np.arange(len(indices)+1)
   data=np.ones(len(indices))
   shape=(len(indices),N)
   return sparse.csr_matrix((data,indices,indptr), shape=shape)

我在csr进行了测试：

In [185]: M
Out[185]: 
<30x40 sparse matrix of type '<class 'numpy.float64'>'
    with 76 stored elements in Compressed Sparse Row format>

In [186]: indices=np.r_[0:20]

In [187]: M[indices,:]
Out[187]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

In [188]: extractor(indices, M.shape[0])*M
Out[188]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

与许多其他csr方法一样，它使用矩阵乘法来生成最终值。在这种情况下，稀疏矩阵在所选行中为1。时间实际上好一点。

In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop

在您的情况下，提取器矩阵的形状为（len（training_indices），347），仅有len(training_indices)个值。所以它并不大。

但是如果matrix如此之大（或者至少第二维如此之大）以至于它在矩阵乘法例程中产生一些错误，那么它可能会导致分段错误，而不会捕获python / numpy。< / p>

matrix.sum(axis=1)是否有效。它也使用矩阵乘法，但密集矩阵为1。或sparse.eye(347)*M，类似大小的矩阵乘法？

使用int列表进行稀疏矩阵切片

1 个答案: