我有一个形状稀疏的矩阵(346679,86)。
<346679x86 sparse matrix of type '<type 'numpy.int8'>' with 470018 stored elements in COOrdinate format>
为了训练和评估我的模型,我需要分别将它分成训练和测试集。
from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(sparse_matrix, test_size=0.2, random_state=11)
一旦完成,我发现x_train和x_test已经改变,即一些整行被转为0.我使用下面的代码检查了原始矩阵的零非零值的行的出现:
def get_zero_rows(sparse_matrix):
sparse_matrix = sparse_matrix.tocsr()
count = 0
for index, each in enumerate(sparse_matrix):
if each.getnnz() < 1:
count += 1
return count
对于原始矩阵,它返回0,但对于分割矩阵,它返回非零值。我不明白为什么会这样?
答案 0 :(得分:1)
根据train_test_split
文档中的示例构建:
In [895]: X, y = sparse.random(50,10,.2,'csr'), range(50)
In [896]: X_train, X_test, y_train, y_test = train_test_split(
...: ... X, y, test_size=0.33, random_state=42)
...:
In [897]: X
Out[897]:
<50x10 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
In [898]: X_train
Out[898]:
<33x10 sparse matrix of type '<class 'numpy.float64'>'
with 68 stored elements in Compressed Sparse Row format>
In [899]: X_test
Out[899]:
<17x10 sparse matrix of type '<class 'numpy.float64'>'
with 32 stored elements in Compressed Sparse Row format>
非零总数
没有变化In [900]: np.count_nonzero(X.sum(1)==0)
Out[900]: 4
In [901]: np.count_nonzero(X_test.sum(1)==0)
Out[901]: 2
In [902]: np.count_nonzero(X_train.sum(1)==0)
Out[902]: 2
0行总和也保持不变。
当我尝试使用
时X = (sparse.random(50,10,.2,'csr')*10).astype('int8')
0行计数保持一致,但我得到的nnz
元素较少。使用int8
的稀疏数学可能存在问题。标准int
或float
dtypes可能更安全。
sparse
使用矩阵乘法(带extractor
矩阵)进行行索引,我相信它是针对32/64位dtypes编译的。
我看到的'问题'是我如何构造整数稀疏矩阵的人为因素。我没有正确eliminated zeros
。
In [20]: from scipy import sparse
In [21]: M = sparse.random(100,10,.2,'csr')
In [22]: M
Out[22]:
<100x10 sparse matrix of type '<class 'numpy.float64'>'
with 200 stored elements in Compressed Sparse Row format>
In [23]: idx=np.arange(100)
In [24]: M[idx,:]
Out[24]:
<100x10 sparse matrix of type '<class 'numpy.float64'>'
with 200 stored elements in Compressed Sparse Row format>
通过缩放浮点数来制作随机整数矩阵:
In [25]: M1 = (M*10).astype(int)
In [26]: M1
Out[26]:
<100x10 sparse matrix of type '<class 'numpy.int64'>'
with 200 stored elements in Compressed Sparse Row format>
索引减少了元素的数量:
In [27]: M1[idx,:]
Out[27]:
<100x10 sparse matrix of type '<class 'numpy.int64'>'
with 183 stored elements in Compressed Sparse Row format>
但这与count_nonzero
找到的数字相同。如果我申请elimnate_zeros
,我会得到什么:
In [29]: M1.count_nonzero()
Out[29]: 183
In [30]: M1.eliminate_zeros()
In [31]: M1
Out[31]:
<100x10 sparse matrix of type '<class 'numpy.int64'>'
with 183 stored elements in Compressed Sparse Row format>
In [32]: M1[idx,:]
Out[32]:
<100x10 sparse matrix of type '<class 'numpy.int64'>'
with 183 stored elements in Compressed Sparse Row format>
使用这个缩放构造函数,像0.04这样的浮点值变为0,但在我们明确这样做之前不会从稀疏性中删除。