Question

举个例子，我有以下输入数据（我正在使用的点云更复杂）

Data = [1,1,1,1,1],[1,1,2,1,1],[2,2,2,1,1],[3,3,3,1,1],[4,4,4,1,1],[5,5,5,1,1],[50,50,50,1,1],[95,95,95,1,1],[96,96,96,1,1],[97,97,97,1,1],[98,98,98,1,1],[99,99,99,1,1],[2,2,3,1,1],[2,2,1,1,1],[2,2,4,1,1]

聚类算法给出二进制上三角矩阵（让我们称之为连接矩阵）。 A 1表示连接两个点。例如。点ID 0（第0行）连接到它自己（第0列）和1,2,3,12,13,14。但是第4点和第5点也可通过3,12,13和14到达。

[ 1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]

我可以使用rowclustering（s）识别每行的簇，其中s是上面的二进制矩阵。

def rowclustering(s):
    r = 0
    idx = []
    while r < size(s,0):
        row = []
        for i in range(size(s,1)):
            if s[r][i] == 1:
                row = row + [i]
        r = r + 1
        idx = idx + [row]
    return idx

返回的idx是：

idx = [[0, 1, 2, 3, 12, 13, 14], [1, 2, 3, 12, 13, 14], [2, 3, 4, 12, 13, 14], [3, 4, 5, 12, 13, 14], [4, 5, 12, 14], [5], [6], [7, 8, 9], [8, 9, 10], [9, 10, 11], [10, 11], [11], [12, 13, 14], [13, 14], [14]]

现在，显然群集数量少于15个，因为有些行通过公共ID连接（例如，查看ID 4和5）。我想要的是：

result = [[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]

我试图创建一个函数（clustering（idx，f）），这样做，其中idx是rowclustering（s）的结果，f将是idx中的第一行，例如[0,1,2,3,12,13,14]。但是，此功能无法正常完成。在完成所有连接（idx ID）之后，打破函数的正确代码是什么？

def clustering(idx,f):
    for i in f:
        f = f + idx[i]
    f = list(set(f))
    clustering(idx,f)

    return

我试图解决的问题是一种自我增长的过程。函数聚类应调用自身，直到建立所有可能的点连接。这可以在idx上完成，或者（可能更好）在连接矩阵上进行（矩阵缩减？）。

非常感谢任何帮助！如果我应该澄清我的问题，请告诉我。感谢。

Answer 1

您的问题可以被视为查找连接的组件。您可以使用networkx来获得解决方案，或者您可以自己实现BFS（广度优先搜索）。

import networkx as nx
import numpy as np
x = """
[ 1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
"""
mat = eval('[' + x.replace('.', '.,').replace(']', '],') + ']')
mat = np.array(mat)
G = nx.from_numpy_matrix(np.array(mat))
print(list(nx.connected_components(G)))
[{0, 1, 2, 3, 4, 5, 12, 13, 14}, {6}, {7, 8, 9, 10, 11}]

编辑：

实际上，关于这个问题的一些事情让我记得我以前读过的东西。这实际上可以仅使用矩阵运算来计算。它非常漂亮。你的初始矩阵是邻接矩阵（A），我们还需要指定一个度矩阵（D），它保持对角线上每个节点的度数。我们可以使用它们来定义拉普拉斯矩阵（L），然后使用一些谱图理论。（耶！）

# Make the undirected version of the graph (no self loops)
A = (mat + mat.T) * (1 - np.eye(mat.shape[0]))
# Make the degree matrix
D = np.diag(A.sum(axis=1) + A.sum(axis=0)) / 2
# thats all we need to define the laplacian
L = D - A

# The number of zeros eigenvalues of the Laplacian is exactly the number of CCs
np.isclose(np.linalg.eigvals(L), 0).sum()

3 

# The connected compoments themselves are identified by rows that have the same nullspace vector
u, s, vh = np.linalg.svd(L)
ns = vh[(s >= 1e-13).sum():].conj().T

array([[-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.19222441,  0.97663659,  0.09607676],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ]])

现在，我们计算了答案！解释起来有点奇怪。一点点处理可以将其转换为您想要的表示。

# the following is a little numpy trick to find unique rows 
# chopping off the last few decimal places to account for numerical errors
ns_ = np.ascontiguousarray(np.round(ns, 8)).view(np.dtype((np.void, ns.dtype.itemsize * ns.shape[1])))
ns_basis, row_to_cc_id = np.unique(ns_, return_inverse=True)
# Finally we can just use this to convert to the desired output format
groups = [[] for _ in range(len(ns_basis))]
for row, id in enumerate(row_to_cc_id):
    groups[id].append(row)
print(groups)
[[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]

Python中的点云聚类分析 - 从二进制矩阵

1 个答案: