Question

我有两个如下的稀疏矩阵：

将numpy导入为np 从scipy.sparse导入csr_matrix

m1_colnames = ['a', 'b', 'd', 'e', 't', 'y']
m1 = csr_matrix(np.array([[1, 2, 0, 4, 5, 0], [1, 2, 0, 4, 5, 0]]))

m2_colnames = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
m2 = csr_matrix(np.array([[1, 2, 0, 0, 4, 0, 4, 5, 0], [1, 2, 0, 0, 4, 0, 4, 5, 0]]))

本质上，我想做的（以熊猫为单位）是按列名合并，以得到最终的稀疏矩阵，其大小为11（11个唯一的列名）乘以4（4行）。

但是，由于我的真实数据集超过1000000行乘以100000列（稀疏矩阵），因此我无法转换为大熊猫。

这怎么办？我需要列名的最终列表，以便我知道合并后的稀疏矩阵中事物的顺序。

谢谢，杰克

编辑：

所需的输出：

final_colnames = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 't', 'y']
final_m = csr_matrix(np.array([[1, 2, 0, 0, 4, 0, 4, 5, 0, 0, 0], [1, 2, 0, 0, 4, 0, 4, 5, 0, 0, 0], [1, 2, 0, 0, 4, 0, 0, 0, 0, 4, 5], [1, 2, 0, 0, 4, 0, 0, 0, 0, 4, 5]]))

尽管我正在寻找一种没有大熊猫的方法，但在熊猫中做事的方式：

df1 = pd.DataFrame(m1.A, columns = m1_colnames)
df2 = pd.DataFrame(m2.A, columns = m2_colnames)

final_df = pd.concat(df1, df2)
final_df = final_df.fillna(0)

final_sparse = csr_matrix(final_df.values)
final_colnames = final_df.columns

final_sparse和final_colnames是我想要的。

Answer 1

基本稀疏合并

In [503]: m1_colnames = ['a', 'b', 'd', 'e', 't', 'y']
     ...: m1 = sparse.coo_matrix(np.array([[1, 2, 0, 4, 5, 0], [1, 2, 0, 4, 5, 0]]))
     ...: m2_colnames = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
     ...: m2 = sparse.coo_matrix(np.array([[1, 2, 0, 0, 4, 0, 4, 5, 0], [1, 2, 0, 0, 4, 0, 4, 5, 0
     ...: ]]))

In [504]: m1
Out[504]: 
<2x6 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in COOrdinate format>
In [505]: m2
Out[505]: 
<2x9 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in COOrdinate format>

m1的关键属性是：

In [506]: m1.data
Out[506]: array([1, 2, 4, 5, 1, 2, 4, 5])
In [508]: m1.row
Out[508]: array([0, 0, 0, 0, 1, 1, 1, 1], dtype=int32)
In [509]: m1.col
Out[509]: array([0, 1, 3, 4, 0, 1, 3, 4], dtype=int32)

与m2类似。

根据您的列名条件，您只需要提供一组data，row和col的新数组即可定义合并矩阵。

由于您是按列合并的，因此row和data的值将保持不变并且可以串联

m3row  = np.concatenate((m1.row, m2.row))
m3data = np.concatenate((m1.data, m2.data))

创建m3col会更加复杂，因为它基于您的列名条件。出于说明目的，我将在m2之后附加m1（例如hstack）

In [515]: m3col = np.concatenate((m1.col, m2.col+6))
     ...: 
     ...: m3 = sparse.coo_matrix((m3data, (m3row, m3col)))

In [516]: m3
Out[516]: 
<2x14 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in COOrdinate format>
In [517]: m3.A
Out[517]: 
array([[1, 2, 0, 4, 5, 0, 1, 2, 0, 0, 4, 0, 4, 5],
       [1, 2, 0, 4, 5, 0, 1, 2, 0, 0, 4, 0, 4, 5]])

更正的行

在重新读取时，您似乎希望将每个矩阵放在单独的行中，所以这样可能会更好

In [520]: m3row  = np.concatenate((m1.row, m2.row+2))
     ...: m3data = np.concatenate((m1.data, m2.data))
     ...: m3col  = np.concatenate((m1.col, m2.col+2))
     ...: shape = (4,11)

In [522]: m3 = sparse.coo_matrix((m3data, (m3row, m3col)), shape=shape)
In [523]: m3
Out[523]: 
<4x11 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in COOrdinate format>
In [524]: m3.A
Out[524]: 
array([[1, 2, 0, 4, 5, 0, 0, 0, 0, 0, 0],
       [1, 2, 0, 4, 5, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 2, 0, 0, 4, 0, 4, 5, 0],
       [0, 0, 1, 2, 0, 0, 4, 0, 4, 5, 0]])

根据评论中的要求，显示所需的矩阵将有所帮助。我们不想猜测。另外，您应该做真正的工作。

合并列

花费了一些时间，但是我想我已经提出了一种合理的列分组方法。 sparse和numpy并没有像pandas这样的东西。

运行生成的代码：

In [622]: final_sparse.A
Out[622]: 
array([[1., 2., 0., 0., 4., 0., 0., 0., 0., 5., 0.],
       [1., 2., 0., 0., 4., 0., 0., 0., 0., 5., 0.],
       [1., 2., 0., 0., 4., 0., 4., 5., 0., 0., 0.],
       [1., 2., 0., 0., 4., 0., 4., 5., 0., 0., 0.]])

首先收集名称，然后获得唯一的（排序的）列表：

In [623]: colnames=[]
In [624]: for col in [m1_colnames, m2_colnames]:
     ...:     colnames.extend(col)
     ...:     
In [625]: unames = np.unique(colnames)
In [626]: unames
Out[626]: array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 't', 'y'], dtype='<U1')

应该与熊猫相同：

In [627]: final_colnames
Out[627]: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 't', 'y'], dtype='object')

可以在列表中找到m1_colnames中的unames，但幸运的是，np.searchsorted的工作原理也一样：

In [631]: np.searchsorted(unames, m1_colnames)
Out[631]: array([ 0,  1,  3,  4,  9, 10])

然后可以将其用于将原始m1.col映射到新矩阵上：

In [632]: _[m1.col]
Out[632]: array([0, 1, 4, 9, 0, 1, 4, 9])

因此对于所有矩阵：

In [633]: alist = []
In [634]: for n, col in zip([m1_colnames, m2_colnames],[m1.col, m2.col]):
     ...:     alist.append(np.searchsorted(unames, n)[col])  
In [635]: alist
Out[635]: [array([0, 1, 4, 9, 0, 1, 4, 9]), array([0, 1, 4, 6, 7, 0, 1, 4, 6, 7])]
In [636]: m3col = np.hstack(alist)
In [637]: m3data.shape
Out[637]: (18,)
In [638]: m3col.shape    # sanity check
Out[638]: (18,)

像以前一样建立稀疏矩阵：

In [639]: m3 = sparse.coo_matrix((m3data, (m3row, m3col)), shape=shape)
In [640]: m3.A
Out[640]: 
array([[1, 2, 0, 0, 4, 0, 0, 0, 0, 5, 0],
       [1, 2, 0, 0, 4, 0, 0, 0, 0, 5, 0],
       [1, 2, 0, 0, 4, 0, 4, 5, 0, 0, 0],
       [1, 2, 0, 0, 4, 0, 4, 5, 0, 0, 0]])

测试

In [641]: np.allclose(m3.A, final_sparse.A)
Out[641]: True

根据列名称“合并”两个稀疏矩阵（在单独的列表中）

1 个答案:

基本稀疏合并

更正的行

合并列