使用hstack时格式错误的矩阵?

时间:2017-03-28 16:41:18

标签: python python-3.x numpy scipy

我有以下矩阵:

>>> X1
shape: (2399, 39999)
type: scipy.sparse.csr.csr_matrix

>> X2
shape: (2399, 333534)
type: scipy.sparse.csr.csr_matrix

>>>X3.reshape(-1,1)
shape: (2399, 1)
type: <class 'numpy.ndarray'>

如何在右侧连接X1和X2,以生成具有以下形状的新矩阵:(2399, 373534)。我知道这可以用scipy的hstackvstack来完成。但是,当我试图:

X_combined = sparse.hstack([X1,X2,X3.T])

然而,我得到了一个格式错误的最终矩阵:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

因此,如何在单个矩阵中正确连接?。

更新

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df=5)
X1 = count_vect.fit_transform(X)

from sklearn.feature_extraction.text import TfidfVectorizer
tdidf_vect = TfidfVectorizer()
X2 = tdidf_vect.fit_transform(X)

from hdbscan import HDBSCAN
clusterer = HDBSCAN().fit(X1)
X3 = clusterer.labels_
print(X3.shape)
print(type(X3))

然后:

在:

import scipy as sparse

X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])

输出:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-14baa47e0993> in <module>()
      5 
      6 
----> 7 X_combined = sparse.hstack([X1,X2,X3.reshape(-1,1)])

/usr/local/lib/python3.5/site-packages/numpy/core/shape_base.py in hstack(tup)
    284     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    285     if arrs[0].ndim == 1:
--> 286         return _nx.concatenate(arrs, 0)
    287     else:
    288         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

2 个答案:

答案 0 :(得分:2)

为什么X3.TX3.reshape(-1,1)形状与其他形状兼容

sparse.hstack([X1,X2,X3.reshape(-1,1)])

应该有效。

[(2399, 39999), (2399, 333534), (2399, 1)]

sparse.hstack的使用在这里是正确的;但是关于匹配维度的相同规则适用,无论是稀疏还是密集。

In [207]: M
Out[207]: 
<10x3 sparse matrix of type '<class 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>
In [208]: sparse.hstack((M,M))
Out[208]: 
<10x6 sparse matrix of type '<class 'numpy.int32'>'
    with 18 stored elements in COOrdinate format>
在进行连接版本之前,

sparse.hstack会将A转换为稀疏。

In [209]: A=np.ones((10,1),int)
In [210]: sparse.hstack((M,M,A))
Out[210]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

或者你可以先将其转换为稀疏。

In [211]: As=sparse.csr_matrix(A)
In [212]: As
Out[212]: 
<10x1 sparse matrix of type '<class 'numpy.int32'>'
    with 10 stored elements in Compressed Sparse Row format>
In [213]: sparse.hstack((M,M,As))
Out[213]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

从1d A

开始
In [214]: A=np.ones((10),int)
In [215]: sparse.hstack([M,M,A.reshape(-1,1)])
Out[215]: 
<10x7 sparse matrix of type '<class 'numpy.int32'>'
    with 28 stored elements in COOrdinate format>

答案 1 :(得分:2)

问题是你的导入,应该是

from scipy import sparse

顶级scipy模块(通常你不应该使用顶级scipy模块)导入numpy函数,所以当你尝试你的版本时:

>>> import scipy as sparse
>>> sparse.hstack
<function numpy.core.shape_base.hstack>

>>> # incorrect! Correct would be

>>> from scipy import sparse
>>> sparse.hstack
<function scipy.sparse.construct.hstack>

这些都在他们的documentation中提到:

  

scipy名称空间本身只包含从numpy导入的函数。这些函数仍然存在以实现向后兼容,但应该直接从numpy导入。

     

scipy子模块的命名空间中的所有内容都是公共的。通常,建议从子模块命名空间中导入函数。