Question

我基于相似性矩阵多次聚类数据帧索引（试验）并将聚类分配存储在数据帧中，如下所示：

        trial 0  trial 1  trial 2  trial 3
index 0    0        1        0        0
index 1    0        1        1        0
index 2    2        0        2        0
index 3    1        2        2        1

在每次试验之前将噪声添加到相似性矩阵中，因此群集分配是非确定性的（因此每个试验的分配差异）。所以要明确：每个试验对应一个完整的聚类运行，这些值对应于该试验的聚类。

在上面的示例中，index 0和index 1在同一群集中共同出现了三次。

我想要的是像这样的共现矩阵：

        index 0  index 1  index 2  index 3
index 0    4        3        1        0   
index 1    3        4        1        0
index 2    1        1        4        1
index 3    0        0        1        4

其中每个值对应于所有试验中指数共同出现的聚类数。

在熊猫中有一种有效的方法吗？我可以很容易地用循环来管理它，但是我的试验数据框有几千个索引和试验。

Answer 1

这是一个只需要在列上循环的解决方案。

res = sum(df[c].transform(lambda x: x == df[c]) for c in df.columns)

但是，如果您的数据相当稀疏，使用循环或图表可能会更快。

Answer 2

我想出了如何使用一些线性代数来做到这一点。

首先，将试验矩阵分解为与每个数字相对应的总和（群集编号应从1开始，以便在数学上制定方法，尽管在实现中不需要这样做。）

那是：

        trial 0  trial 1  trial 2  trial 3
index 0    0        1        0        0
index 1    0        1        1        0
index 2    2        0        2        0
index 3    1        2        2        1

变为

        trial 0  trial 1  trial 2  trial 3
index 0    1        2        1        1
index 1    1        2        2        1
index 2    3        1        3        1
index 3    2        3        3        2

（递增1），按如下方式分解：

T = 1  0  1  1  +  2 * 0  1  0  0  + 3 * 0  0  0  0
    1  0  0  1         0  1  1  0        0  0  0  0
    0  1  0  1         0  0  0  0        1  0  1  0
    0  0  0  0         1  0  0  1        0  1  1  0

然后将每个（标准化的）分量矩阵乘以其转置并求和：

C1*C1.T/1 + C2*C2.T/2 + C3*C3.T/3

其中Ci是与T对应的i的矩阵组件。

该总和则是得到的共生矩阵。以下是上述示例的实现和结果：

test = pd.DataFrame(np.array([[0, 1, 0, 0], 
                              [0, 1, 1, 0], 
                              [2, 0, 2, 0], 
                              [1, 2, 2, 1]]), 
                    columns = ['trial 1', 'trial 2', 'trial 3', 'trial 4'])
test_val = test.values

# Base matrix that will be added to.
curr_mat = np.zeros((test_val.shape[0], test_val.shape[0]))

# Max index of matrix components (i.e. max_val + 1 is number of clusters/matrix components)
max_val = np.max(test_val)

for n_clus in range(max_val + 1):

    # Extract component matrix corresponding to current iteration.
    clus_mem = (test_val == n_clus).astype(int)
    curr_mat += np.dot(clus_mem, clus_mem.T)

res = pd.DataFrame(curr_mat, index=test.index, columns=test.index)

结果：

         index 0  index 1  index 2  index 3
index 0     4        3        1        0
index 1     3        4        1        0
index 2     1        1        4        1
index 3     0        0        1        4

不幸的是我不得不使用for循环，但迭代次数现在只是集群的数量，我利用了numpy的高效数组操作。

Pandas：计算数据帧中相同值的索引成对出现次数

2 个答案: