我试图计算20000个聚类数据点的Silhouette系数

时间:2018-02-14 06:21:10

标签: python cluster-computing

[注意:堆栈迫使我缩进它检测为代码的一些非代码块]

我正在尝试计算每个xi的粗体并将其存储在列表中

对于每个数据点xi:

**A = average distance of xi to other points in same cluster.**  
B = Minimum (the average distances of xi to other points in each different cluster)  
S = (B-A) / max(A,B)  
SC=SC+S
SC = SC / number of data points.

我遇到的问题是我的代码在数据很小时有效;但是,当它必须筛选的数据是20000时,它会无限期地运行 “数据”中的每个列表代表:[#,not_important,x-value,y-value]
群集中的每个元素都映射到数据中的#。

import numpy as np
from scipy import linalg as la
data = [[0,0,1,2,], [1, 0, 5 ,7], [2, 0, 1, 9], [3, 1, 2, 0], [4,0,22,5]]
clusters = [[1,2], [3,4]]

A = list()
for c in clusters:
    A_i = list()
    for i in c:
        x = data[i]
        a = 0
        for j in range(len(c)):
            d_i = np.array(x[2:])
            d_j = np.array(data[c[j]][2:])
            a += la.norm(d_i - d_j)
        a = a / len(c)
        A_i.append(a)
    A.append(A_i)
print A


# Output: [[2.23606797749979, 2.23606797749979], [10.307764064044152, 10.307764064044152]]

^这是有效的

然而,当使用超过20000个观测值的真实数据时,它无法完成运行:
即真实数据:

data = [[0.0, 5.0, -13.9383184, -20.94943593], [1.0, 0.0, -26.2837148, 16.83670948], [2.0, 4.0, 36.89961228, -19.90378864], ...]

cluster = [[5, 6, 8, 14, ...], [0, 1, 7, 10,...], [2, 3, 4, 9, ...]]

where len(data) = 20000  
and len(cluster[0]) + len(cluster[1]) + len(cluster[2] = 20000

0 个答案:

没有答案