给出了(标准化的)稀疏邻接矩阵和各个矩阵行的标签列表。由于某些节点已被另一个清理函数删除,因此矩阵中有一些行包含NaN。我想找到这些行并删除它们以及它们各自的标签。这是我写的函数:
def sanitize_nan_rows(adj, labels):
# convert to numpy array and keep dimension
adj = np.array(adj, ndmin=2)
for i, row in enumerate(adj):
# check if row all nans
if np.all(np.isnan(row)):
# print("Removing nan row label in %s" % i)
# remove row index from labels
del labels[i]
# remove all nan rows
adj = adj[~np.all(np.isnan(adj), axis=1)]
# return sanitized adj and labels_clean
return adj, labels
labels
是一个简单的Python列表,adj
的类型为<class 'scipy.sparse.lil.lil_matrix'>
(包含<class 'numpy.float64'>
类型的元素),它们都是
adj, labels = nx.attr_sparse_matrix(infected, normalized=True)
执行时,我收到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
6 for i, row in enumerate(adj):
7 # check if row all nans
----> 8 if np.all(np.isnan(row)):
9 print("Removing nan row label in %s" % i)
10 # remove row index from labels
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
所以我认为SciPy NaNs与numpy NaNs不同。之后,我尝试将稀疏矩阵转换为numpy数组(冒着充斥我的RAM的风险,因为矩阵有大约40k行和列)。运行时,错误保持不变。似乎np.array()
调用刚刚包裹了稀疏矩阵并且没有转换它,因为for循环中的type(row)
仍会输出<class 'scipy.sparse.lil.lil_matrix'>
所以我的问题是如何解决这个问题以及是否有更好的方法来完成工作。我对numpy和scipy(在networkx中使用)相当新,所以我很感激解释。谢谢!
编辑:将转换更改为hpaulj提议的内容后,我收到了MemoryError:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)
<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
1 def sanitize_nans(adj, labels):
----> 2 adj = adj.toarray()
3
4 for i, row in enumerate(adj):
5 # check if row all nans
/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
348 def toarray(self, order=None, out=None):
349 """See the docstring for `spmatrix.toarray`."""
--> 350 d = self._process_toarray_args(order, out)
351 for i, row in enumerate(self.rows):
352 for pos, j in enumerate(row):
/usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
697 return out
698 else:
--> 699 return np.zeros(self.shape, dtype=self.dtype, order=order)
700
701
MemoryError:
显然,我必须坚持使用稀疏矩阵来节省RAM。
答案 0 :(得分:1)
如果我制作一个样本数组:
In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
In [331]: M=sparse.lil_matrix(A)
这个lil稀疏矩阵存储在2个数组中:
In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)
使用您的函数,即使稀疏矩阵的中间行仅包含nan
,也不会删除任何行。
In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]:
array([[ 1., 0., 0., nan],
[ 0., nan, nan, 0.],
[ 1., 0., 1., 0.]])
我可以为M
测试nan
行,并确定仅包含nan
(除0之外)的行。但是收集我们想要保留的东西可能更容易。
In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]:
<2x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]:
array([[ 1., 0., 0., nan],
[ 1., 0., 1., 0.]])
M
行是一个列表,但np.isnan(row)
会将其转换为数组并进行数组测试。