Question

我正在使用scipy.sparse.csr_matrix构建一个稀疏向量，如下所示：

csr_matrix((values, (np.zeros(len(indices)), indices)), shape = (1, max_index))

这适用于我的大多数数据，但偶尔会得到ValueError: could not convert integer scalar。

这再现了问题：

In [145]: inds

Out[145]:
array([ 827969148,  996833913, 1968345558,  898183169, 1811744124,
        2101454109,  133039182,  898183170,  919293479,  133039089])

In [146]: vals

Out[146]:
array([ 1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.])

In [147]: max_index

Out[147]:
2337713000

In [143]: csr_matrix((vals, (np.zeros(10), inds)), shape = (1, max_index+1))
...

    996         fn = _sparsetools.csr_sum_duplicates
    997         M,N = self._swap(self.shape)
--> 998         fn(M, N, self.indptr, self.indices, self.data)
    999 
    1000         self.prune()  # nnz may have changed

ValueError: could not convert integer scalar

inds是np.int64数组，vals是np.float64数组。

scipy sum_duplicates代码的相关部分是here。

请注意，这有效：

In [235]: csr_matrix(([1,1], ([0,0], [1,2])), shape = (1, 2**34))
Out[235]:

<1x17179869184 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

所以问题不在于其中一个维度是> 2^31

为什么这些值应该导致问题？

Answer 1

可能是max_index＆gt; 2 ** 31？试试这个，只是为了确保：

csr_matrix((vals, (np.zeros(10), inds/2)), shape = (1, max_index/2))

Answer 2

您提供的最大索引小于您提供的行的最大索引。

此 sparse.csr_matrix((vals, (np.zeros(10), inds)), shape = (1, np.max(inds)+1)) 和我一起工作很好。

虽然制作.todense（）会导致矩阵的大尺寸内存错误

Answer 3

取消注释sum_duplicates - 函数将导致其他错误。但是这个修复：strange error when creating csr_matrix也解决了你的问题。您可以将version_check扩展到较新版本的scipy。

import scipy 
import scipy.sparse  
if scipy.__version__ in ("0.14.0", "0.14.1", "0.15.1"): 
    _get_index_dtype = scipy.sparse.sputils.get_index_dtype 
    def _my_get_index_dtype(*a, **kw): 
        kw.pop('check_contents', None) 
        return _get_index_dtype(*a, **kw) 
    scipy.sparse.compressed.get_index_dtype = _my_get_index_dtype 
    scipy.sparse.csr.get_index_dtype = _my_get_index_dtype 
    scipy.sparse.bsr.get_index_dtype = _my_get_index_dtype

神秘的scipy“无法转换整数标量”错误

3 个答案: