按行分割稀疏矩阵

时间:2019-06-21 15:50:01

标签: python numpy sparse-matrix

我有一个scipy.sparse.csr.csr_matrix尺寸的(8723, 1741277)

如何有效地按行将其分成n个块?

最好是块的行数大致相等。

我之所以这么说是因为这取决于(行数)/(块数)是否还清余数。

我认为您可以使用numpy.split轻松完成数组操作,但似乎不适用于稀疏矩阵。

具体来说,如果我选择不能用8723整除的n块数,则会出现此错误:

ValueError: array split does not result in an equal division

如果我选择n块数(可以用8723整除),则会出现此错误:

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

我想将稀疏矩阵拆分为多个块的原因是因为我想将稀疏矩阵转换为(密集)数组,但由于整体太大而无法直接执行。

1 个答案:

答案 0 :(得分:0)

In [6]: from scipy import sparse                                                                     
In [7]: M = sparse.random(12,3,.1,'csr')                                                             
In [8]: np.split?                                                                                    
In [9]: np.split(M,3)                                                                                
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
    687         else:
--> 688             raise AttributeError(attr + " not found")
    689 

AttributeError: swapaxes not found

During handling of the above exception, another exception occurred:

AxisError                                 Traceback (most recent call last)
<ipython-input-9-11a4dcdd89af> in <module>
----> 1 np.split(M,3)

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
    848             raise ValueError(
    849                 'array split does not result in an equal division')
--> 850     res = array_split(ary, indices_or_sections, axis)
    851     return res
    852 

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in array_split(ary, indices_or_sections, axis)
    760 
    761     sub_arys = []
--> 762     sary = _nx.swapaxes(ary, axis, 0)
    763     for i in range(Nsections):
    764         st = div_points[i]

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in swapaxes(a, axis1, axis2)
    583 
    584     """
--> 585     return _wrapfunc(a, 'swapaxes', axis1, axis2)
    586 
    587 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     64     # a downstream library like 'pandas'.
     65     except (AttributeError, TypeError):
---> 66         return _wrapit(obj, method, *args, **kwds)
     67 
     68 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     44     except AttributeError:
     45         wrap = None
---> 46     result = getattr(asarray(obj), method)(*args, **kwds)
     47     if wrap:
     48         if not isinstance(result, mu.ndarray):

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

如果将np.array应用于M,则会得到一个0d对象数组;只是围绕稀疏对象的幼稚包装。

In [10]: np.array(M)                                                                                 
Out[10]: 
array(<12x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>, dtype=object)
In [11]: _.shape                                                                                     
Out[11]: ()

分割正确的密集等效项:

In [12]: np.split(M.A,3)                                                                             
Out[12]: 
[array([[0.        , 0.61858517, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]), array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]), array([[0.        , 0.89573059, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.02334738],
        [0.        , 0.        , 0.        ]])]

和直接稀疏拆分:

In [13]: [M[i:j,:] for i,j in zip([0,4,8],[4,8,12])]                                                 
Out[13]: 
[<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>]

在稀疏矩阵上,这种切片的效率不如在密集矩阵上。密集切片是视图。稀疏副本必须是副本。唯一的例外是lil格式,该格式具有get_rowview方法。尽管有很多功能可以从片段中构造稀疏矩阵,但是并不需要将它们分解的功能。

sklearn可能具有某些拆分功能。它具有一些稀疏的实用程序功能,可满足其自身对稀疏矩阵的使用。