[注意:堆栈迫使我缩进它检测为代码的一些非代码块]
我正在尝试计算每个xi的粗体并将其存储在列表中
对于每个数据点xi:
**A = average distance of xi to other points in same cluster.**
B = Minimum (the average distances of xi to other points in each different cluster)
S = (B-A) / max(A,B)
SC=SC+S
SC = SC / number of data points.
我遇到的问题是我的代码在数据很小时有效;但是,当它必须筛选的数据是20000时,它会无限期地运行
“数据”中的每个列表代表:[#,not_important,x-value,y-value]
群集中的每个元素都映射到数据中的#。
import numpy as np
from scipy import linalg as la
data = [[0,0,1,2,], [1, 0, 5 ,7], [2, 0, 1, 9], [3, 1, 2, 0], [4,0,22,5]]
clusters = [[1,2], [3,4]]
A = list()
for c in clusters:
A_i = list()
for i in c:
x = data[i]
a = 0
for j in range(len(c)):
d_i = np.array(x[2:])
d_j = np.array(data[c[j]][2:])
a += la.norm(d_i - d_j)
a = a / len(c)
A_i.append(a)
A.append(A_i)
print A
# Output: [[2.23606797749979, 2.23606797749979], [10.307764064044152, 10.307764064044152]]
^这是有效的
然而,当使用超过20000个观测值的真实数据时,它无法完成运行:
即真实数据:
data = [[0.0, 5.0, -13.9383184, -20.94943593], [1.0, 0.0, -26.2837148, 16.83670948], [2.0, 4.0, 36.89961228, -19.90378864], ...]
cluster = [[5, 6, 8, 14, ...], [0, 1, 7, 10,...], [2, 3, 4, 9, ...]]
where len(data) = 20000
and len(cluster[0]) + len(cluster[1]) + len(cluster[2] = 20000