我在这里搜索并用Google搜索,但无济于事。当在Weka中进行聚类时,有一个方便的选项,类到聚类,它匹配算法产生的聚类,例如简单的k-means,你提供的'ground truth'类标签作为class属性。这样我们就可以看到集群的准确性(%不正确)。
现在,我如何在Matlab中实现这一点,即翻译我的clusterClasses
向量,例如[1, 1, 2, 1, 3, 2, 3, 1, 1, 1]
与提供的地面实况标签向量相同的索引,例如[2, 2, 2, 3, 1, 3]
?
我认为它可能基于集群中心和标签中心,但我不确定如何实施!
非常感谢任何帮助。
文森特
答案 0 :(得分:4)
它基于尝试所有可能的标签重新排列,以确定最适合真实向量的女巫。这意味着,如果群集结果yte = [3 3 2 1]
具有基本事实y = [1 1 2 3]
,则脚本会尝试将[3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2]
与y
匹配,以找到最佳匹配。
这是基于使用内置脚本perms()
,无法处理10个以上的唯一群集。对于7-10个独特的集群,代码也可能会变慢,因为复杂性会逐渐增加。
function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels,
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%# yte - result from clustering (y-test)
%# y - truth vector
%#
%# Output:
%# accuracy - Overall accuracy for entire clustering (OA). For
%# overall error, use OE = 1 - OA.
%# true_labels - Vector giving the label rearangement witch best
%# match the truth vector (y).
%# CM - Confusion matrix. If unique(yte) = 4, produce a
%# 4x4 matrix of the number of different errors and
%# correct clusterings done.
N = length(y);
cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;
perm = perms(unique(y));
[pN pM] = size(perm);
true_labels = y;
for i=1:pN
flipped_labels = zeros(1,N);
for cl = 1 : pM
flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
end
testAcc = sum(flipped_labels == y')/N;
if testAcc > accuracy
accuracy = testAcc;
maxInd = i;
true_labels = flipped_labels;
end
end
CM = zeros(pM,pM);
for rc = 1 : pM
for cc = 1 : pM
CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
end
end
示例:
[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')
acc =
0.6667
newLabels =
1 2 2 3 2 1
CM =
1 0 0
0 2 0
1 1 1
答案 1 :(得分:0)
您可能希望研究更灵活的评估群集的方法。例如,对计数指标。
“class = cluster”假设对于从机器学习进入群集的人来说是典型的。但是你应该假设某些类可能包含多个集群,或者多个类实际上是集群。这些是应实际检测到的聚类算法的有趣情况。
答案 2 :(得分:0)
对于Python,我需要这个确切的东西,并转换了Vidar发布的代码(可接受的答案)。我将代码分享给任何有兴趣的人。我重命名了变量并删除了混淆矩阵(无论如何,大多数用于机器学习的库都内置了函数)。我注意到由Vincent(http://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement)链接的更快的实现为时已晚。可能更好地将其改编成Python。
#tested with python 3.6
def remap_labels(pred_labels, true_labels):
"""Rename prediction labels (clustered output) to best match true labels."""
# from itertools import permutations # import this into script.
pred_labels, true_labels = np.array(pred_labels), np.array(true_labels)
assert pred_labels.ndim == 1 == true_labels.ndim
assert len(pred_labels) == len(true_labels)
cluster_names = np.unique(pred_labels)
accuracy = 0
perms = np.array(list(permutations(np.unique(true_labels))))
remapped_labels = true_labels
for perm in perms:
flipped_labels = np.zeros(len(true_labels))
for label_index, label in enumerate(cluster_names):
flipped_labels[pred_labels == label] = perm[label_index]
testAcc = np.sum(flipped_labels == true_labels) / len(true_labels)
if testAcc > accuracy:
accuracy = testAcc
remapped_labels = flipped_labels
return accuracy, remapped_labels