从逻辑矩阵到集合列表的最快方式

时间:2016-06-11 00:13:49

标签: python numpy matrix scipy set

我需要将稀疏逻辑矩阵转换为集合列表,其中每个列表[i]包含列[i]的非零值的行集。以下代码有效,但我想知道是否有更快的方法来执行此操作。我使用的实际数据大约是6000x6000,比这个例子更稀疏。

import numpy as np

A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])

rows,cols = A.shape

C = np.nonzero(A)
D = [set() for j in range(cols)]

for i in range(len(C[0])):
    D[C[1][i]].add(C[0][i])

print D

3 个答案:

答案 0 :(得分:4)

如果您将稀疏数组表示为csc_matrix,则可以使用indicesindptr属性来创建集。

例如,

In [93]: A
Out[93]: 
array([[1, 0, 0, 0, 0, 1],
       [0, 1, 1, 1, 1, 0],
       [1, 0, 1, 0, 1, 1],
       [1, 1, 0, 1, 0, 1],
       [1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0],
       [0, 0, 1, 0, 1, 0]])

In [94]: from scipy.sparse import csc_matrix

In [95]: C = csc_matrix(A)

In [96]: C.indptr
Out[96]: array([ 0,  5,  8, 12, 16, 20, 23], dtype=int32)

In [97]: C.indices
Out[97]: array([0, 2, 3, 4, 5, 1, 3, 4, 1, 2, 6, 7, 1, 3, 4, 6, 1, 2, 6, 7, 0, 2, 3], dtype=int32)

In [98]: D = [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])]

In [99]: D
Out[99]: 
[{0, 2, 3, 4, 5},
 {1, 3, 4},
 {1, 2, 6, 7},
 {1, 3, 4, 6},
 {1, 2, 6, 7},
 {0, 2, 3}]

对于数组列表而不是集合,只需不要调用set()

In [100]: [C.indices[C.indptr[i]:C.indptr[i+1]] for i in range(len(C.indptr)-1)]
Out[100]: 
[array([0, 2, 3, 4, 5], dtype=int32),
 array([1, 3, 4], dtype=int32),
 array([1, 2, 6, 7], dtype=int32),
 array([1, 3, 4, 6], dtype=int32),
 array([1, 2, 6, 7], dtype=int32),
 array([0, 2, 3], dtype=int32)]

答案 1 :(得分:2)

由于您已在np.nonzero上呼叫A,请查看此功能是否更快:

>>> from itertools import groupby
>>> C = np.transpose(np.nonzero(A.T))
>>> [{i[1] for i in g} for _, g in groupby(C, key=lambda x: x[0])]
[{0, 2, 3, 4, 5}, {1, 3, 4}, {1, 2, 6, 7}, {1, 3, 4, 6}, {1, 2, 6, 7}, {0, 2, 3}]

有些时间:

In [4]: %%timeit
   ...: C = np.transpose(np.nonzero(A.T))
   ...: [{i[1] for i in g} for _, g in groupby(C, key=lambda x: x[0])]
   ...:
10000 loops, best of 3: 39 µs per loop

In [7]: %%timeit
   ...: C=csc_matrix(A)
   ...: [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])]
   ...:
1000 loops, best of 3: 317 µs per loop

答案 2 :(得分:1)

我不知道增加的速度是否很快,但你的迭代可以用

精简
for i,j in zip(*C):
    D[j].add(i)

defaultdict可以为此任务添加一个很好的触摸:

In [58]: from collections import defaultdict    
In [59]: D=defaultdict(set)
In [60]: for i,j in zip(*C):
    D[j].add(i)

In [61]: D
Out[61]: defaultdict(<class 'set'>, {0: {0, 2, 3, 4, 5}, 1: {1, 3, 4}, 2: {1, 2, 6, 7}, 3: {1, 3, 4, 6}, 4: {1, 2, 6, 7}, 5: {0, 2, 3}})

In [62]: dict(D)
Out[62]: 
{0: {0, 2, 3, 4, 5},
 1: {1, 3, 4},
 2: {1, 2, 6, 7},
 3: {1, 3, 4, 6},
 4: {1, 2, 6, 7},
 5: {0, 2, 3}}

稀疏矩阵的替代方法是lil格式,它将数据保存为列表列表。由于您希望按列收集数据,请从A.T(转置)

创建矩阵
In [70]: M=sparse.lil_matrix(A.T)

In [71]: M.rows
Out[71]: 
array([[0, 2, 3, 4, 5], [1, 3, 4], [1, 2, 6, 7], [1, 3, 4, 6],
       [1, 2, 6, 7], [0, 2, 3]], dtype=object)

哪些是相同的列表。

对于这个小案例,直接迭代比稀疏

更快
In [72]: %%timeit 
   ....: D=defaultdict(set)
   ....: for i,j in zip(*C):
    D[j].add(i)
   ....: 
10000 loops, best of 3: 24.4 µs per loop

In [73]: %%timeit
   ....: D=[set() for j in range(A.shape[1])]
   ....: for i,j in zip(*C):
    D[j].add(i)
   ....: 
10000 loops, best of 3: 22.9 µs per loop

In [74]: %%timeit 
   ....: M=sparse.lil_matrix(A.T)
   ....: M.rows
   ....: 
1000 loops, best of 3: 588 µs per loop

In [75]: %%timeit
   ....: C=sparse.csc_matrix(A)
   ....: D = [set(C.indices[C.indptr[i]:C.indptr[i+1]]) for i in range(C.shape[1])]   ....: 
1000 loops, best of 3: 476 µs per loop

对于大型数组,稀疏矩阵的设置时间不太重要。

==========================

我们真的需要set吗? lil方法的变体是从转置的nonzero开始,即按列

In [90]: C=np.nonzero(A.T)

# (array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5], dtype=int32),
# array([0, 2, 3, 4, 5, 1, 3, 4, 1, 2, 6, 7, 1, 3, 4, 6, 1, 2, 6, 7, 0, 2, 3], dtype=int32))

数字都在那里;我们只需要将第二个列表拆分成与第一个列表相对应的部分

In [91]: i=np.nonzero(np.diff(C[0]))[0]+1

In [92]: np.split(C[1],i)
Out[92]: 
[array([0, 2, 3, 4, 5], dtype=int32),
 array([1, 3, 4], dtype=int32),
 array([1, 2, 6, 7], dtype=int32),
 array([1, 3, 4, 6], dtype=int32),
 array([1, 2, 6, 7], dtype=int32),
 array([0, 2, 3], dtype=int32)]

这比直接迭代慢,但我怀疑它更好地扩展;可能以及任何稀疏的替代方案:

In [96]: %%timeit 
C=np.nonzero(A.T)
   ....: i=np.nonzero(np.diff(C[0]))[0]+1
   ....: np.split(C[1],i)
   ....: 
10000 loops, best of 3: 55.2 µs per loop