Question

就像问题一样，我正在制作一个可视化工具，该工具必须可用于提供的任何数据集。我应该选择的最佳K值是多少？如何选择？

Answer 1

因此，您可以使用纯素食包中的Calinski criterion，而且您的问题措辞值得商de。我希望这是您的期望，否则请发表评论。

例如，您可以执行以下操作：

n = 100
g = 6 
set.seed(g)
d <- data.frame(
  x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
  y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")

这将导致值为5，这意味着您可以使用5个聚类，该算法在k均值聚类的基本性和不相容性方面起作用。您还可以在此基础上编写手动代码。

摘自here的文档：

条件：将用于选择最佳条件的条件划分。默认值为“ calinski”，是指 Calinski-Harabasz（1974）的标准。简单结构索引（“ ssi”）也可以。其他索引在函数clustIndex中可用（包cclust）。根据我们的经验，最有效的两个指数并且最有可能在或接近最佳簇数为“ calinski”和“ ssi”。

手动代码如下所示：

在第一次迭代中，因为没有SSB（方差之间的差）。

wss <- (nrow(d)-1)*sum(apply(d,2,var))
#TSS = WSS ##No betweeness at first observation, total variance equal to withness variance, TSS is total sum of squares, WSS is within sum of squress
for (i in 2:15) wss[i] <- sum(kmeans(d,centers=i)$withinss) #from second observation onward, since TSS would remain constant and between sum of squares will increase, correspondingly withiness would decrease.
#Plotting the same using the plot command for 15 iterations.(This is not constant, you have to decide what iterations you can do here.
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares",col="mediumseagreen",pch=12)

上面的输出看起来像这样，在此线变为恒定的点之后，您必须选择该点以获得最佳的簇大小，在这种情况下为5：

K中的K最佳值应该意味着要在ANY数据集上实现聚类吗？

1 个答案: