我正在使用Silhouette Index在KMeans群集中选择适当数量的群集。 Silhouette Index的代码为here。 基于此代码,我创建了自己的代码(见下文)。问题是对于任何数据集,优选的簇数总是等于最大值,即在这种情况下为15。 我的代码中有错误吗?
private double getSilhouetteIndex(double[][] distanceMatrix,ClusterEvaluation ceval)
{
double si_index = 0;
double[] ca = ceval.getClusterAssignments();
double[] d_arr = new double[ca.length];
List<Double> si_indexes = new ArrayList<Double>();
for (int i=0; i<ca.length; i++)
{
// STEP 1. Compute the average distance between the i-th point and all other points of a given cluster
double a = averageDist(distanceMatrix,ca,i,1);
// STEP 2. Compute the average distance between the i-th point and all points of other clusters
for (int j=0; j<ca.length; j++)
{
double d = averageDist(distanceMatrix,ca,j,2);
d_arr[j] = d;
}
// STEP 3. Compute the the distance from the i-th point to the nearest cluster to which it does not belong
double b = d_arr[0];
for (Double _d : d_arr)
{
if (_d < b)
b = _d;
}
// STEP 4. Compute the Silhouette index for the i-th point
double si = (b - a)/Math.max(a,b);
si_indexes.add(si);
}
// STEP 5. Compute the average index over all observations
double sum = 0;
for(Double _si : si_indexes)
{
sum += _si;
}
si_index = sum/si_indexes.size();
return si_index;
}
private double averageDist(double[][] distanceMatrix, double[] ca, int id, int calc)
{
double avgDist = 0;
double sum = 0;
int len = 0;
// Distances inside the cluster
if (calc == 1)
{
for (int i = 0; i<ca.length; i++)
{
if (ca[i] == ca[id] && i != id)
{
sum += distanceMatrix[id][i];
len++;
}
}
}
// Distances outside the cluster
else
{
for (int i = 0; i<ca.length; i++)
{
if (ca[i] != ca[id] && i != id)
{
sum += distanceMatrix[id][i];
len++;
}
}
}
avgDist = sum/len;
return avgDist;
}
答案 0 :(得分:0)
对于Silhouette Index,据我所知,当你计算群集外点的平均距离时,它实际上应该是the points from the nearest neighbor cluster
而不是群集之外的所有点。