删除scipy稀疏矩阵中的nan行

时间:2016-09-07 20:20:04

标签: python numpy scipy sparse-matrix networkx

给出了(标准化的)稀疏邻接矩阵和各个矩阵行的标签列表。由于某些节点已被另一个清理函数删除,因此矩阵中有一些行包含NaN。我想找到这些行并删除它们以及它们各自的标签。这是我写的函数:

def sanitize_nan_rows(adj, labels):
    # convert to numpy array and keep dimension
    adj = np.array(adj, ndmin=2)

    for i, row in enumerate(adj):
        # check if row all nans
        if np.all(np.isnan(row)):
            # print("Removing nan row label in %s" % i)
            # remove row index from labels
            del labels[i]
    # remove all nan rows
    adj = adj[~np.all(np.isnan(adj), axis=1)]
    # return sanitized adj and labels_clean
    return adj, labels

labels是一个简单的Python列表,adj的类型为<class 'scipy.sparse.lil.lil_matrix'>(包含<class 'numpy.float64'>类型的元素),它们都是

的结果
adj, labels = nx.attr_sparse_matrix(infected, normalized=True)

执行时,我收到以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
      6     for i, row in enumerate(adj):
      7         # check if row all nans
----> 8         if np.all(np.isnan(row)):
      9             print("Removing nan row label in %s" % i)
     10             # remove row index from labels

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

所以我认为SciPy NaNs与numpy NaNs不同。之后,我尝试将稀疏矩阵转换为numpy数组(冒着充斥我的RAM的风险,因为矩阵有大约40k行和列)。运行时,错误保持不变。似乎np.array()调用刚刚包裹了稀疏矩阵并且没有转换它,因为for循环中的type(row)仍会输出<class 'scipy.sparse.lil.lil_matrix'>

所以我的问题是如何解决这个问题以及是否有更好的方法来完成工作。我对numpy和scipy(在networkx中使用)相当新,所以我很感激解释。谢谢!

编辑:将转换更改为hpaulj提议的内容后,我收到了MemoryError:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
      1 def sanitize_nans(adj, labels):
----> 2     adj = adj.toarray()
      3 
      4     for i, row in enumerate(adj):
      5         # check if row all nans

/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
    348     def toarray(self, order=None, out=None):
    349         """See the docstring for `spmatrix.toarray`."""
--> 350         d = self._process_toarray_args(order, out)
    351         for i, row in enumerate(self.rows):
    352             for pos, j in enumerate(row):

    /usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
    697             return out
    698         else:
--> 699             return np.zeros(self.shape, dtype=self.dtype, order=order)
    700 
    701 

MemoryError: 

显然,我必须坚持使用稀疏矩阵来节省RAM。

1 个答案:

答案 0 :(得分:1)

如果我制作一个样本数组:

In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

In [331]: M=sparse.lil_matrix(A)

这个lil稀疏矩阵存储在2个数组中:

In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)

使用您的函数,即使稀疏矩阵的中间行仅包含nan,也不会删除任何行。

In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

我可以为M测试nan行,并确定仅包含nan(除0之外)的行。但是收集我们想要保留的东西可能更容易。

In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]: 
<2x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]: 
array([[  1.,   0.,   0.,  nan],
       [  1.,   0.,   1.,   0.]])

M行是一个列表,但np.isnan(row)会将其转换为数组并进行数组测试。