Question

我将数据表示为单个变量的多个不同直方图。我想确定哪些直方图使用无监督聚类是相似的。我还想知道要使用的最佳簇数。

我已经读过关于地球移动距离度量标准作为直方图之间距离的度量，但不知道如何在常用的聚类算法中使用它（例如，k意味着）。

主要：我使用哪些包和函数来对直方图进行聚类？

次要：如何确定“最佳”群集数量？

示例数据集1（3个单模集群）：

v1 <- rnorm(n=100, mean = 10, sd = 1)  # cluster 1 (around 10)
v2 <- rnorm(n=100, mean = 50, sd = 5)  # cluster 2 (around 50)
v3 <- rnorm(n=100, mean = 100, sd = 10) # cluster 3 (around 100)
v4 <- rnorm(n=100, mean = 12, sd = 2)  # cluster 1
v5 <- rnorm(n=100, mean = 45, sd = 6)  # cluster 2
v6 <- rnorm(n=100, mean = 95, sd = 6)  # cluster 3

示例数据集2（3个双模群集）：

b1  <- c(rnorm(n=100, mean=9, sd=2) , rnorm(n=100, mean=200, sd=20))   # cluster 1 (around 10 and 200)
b2  <- c(rnorm(n=100, mean=50, sd=5), rnorm(n=100, mean=100, sd=10))  # cluster 2 (around 50 and 100)
b3  <- c(rnorm(n=100, mean=99, sd=8), rnorm(n=100, mean=175, sd=17)) # cluster 3 (around 100 and 175)
b4  <- c(rnorm(n=100, mean=12, sd=2), rnorm(n=100, mean=180, sd=40))  # cluster 1
b5  <- c(rnorm(n=100, mean=45, sd=6), rnorm(n=100, mean=80, sd=30))  # cluster 2
b6  <- c(rnorm(n=100, mean=95, sd=6), rnorm(n=100, mean=170, sd=25))  # cluster 3
b7  <- c(rnorm(n=100, mean=10, sd=1), rnorm(n=100, mean=210, sd=30))   # cluster 1 (around 10 and 200)
b8  <- c(rnorm(n=100, mean=55, sd=5), rnorm(n=100, mean=90, sd=15))  # cluster 2 (around 50 and 100)
b9  <- c(rnorm(n=100, mean=89, sd=9), rnorm(n=100, mean=165, sd=20)) # cluster 3 (around 100 and 175)
b10 <- c(rnorm(n=100, mean=8, sd=2), rnorm(n=100, mean=160, sd=30))  # cluster 1
b11 <- c(rnorm(n=100, mean=55, sd=6), rnorm(n=100, mean=110, sd=10))  # cluster 2
b12 <- c(rnorm(n=100, mean=105, sd=6), rnorm(n=100, mean=185, sd=21))  # cluster 3

Answer 1

示例数据集1的聚类解决方案：

library(HistDAWass)

# create lists of histogram distributions
lod<-vector("list",6)
lod[[1]] <- data2hist(v1, type = "regular")
lod[[2]] <- data2hist(v2, type = "regular")
lod[[3]] <- data2hist(v3, type = "regular")
lod[[4]] <- data2hist(v4, type = "regular")
lod[[5]] <- data2hist(v5, type = "regular")
lod[[6]] <- data2hist(v6, type = "regular")

# combine separate lists into a matrix of histogram objects
mymat <- new("MatH", nrows=6, ncols=1, ListOfDist=lod, names.rows=c(1:6), names.cols="density")

# calculate clusters pre-specifying number of clusters (k)
WH_kmeans(mymat, k=3)

# the output of this gives the expected 3 clusters

示例数据集2的群集解决方案：

lod<-vector("list",12)
lod[[1]] <- data2hist(b1, type = "regular")
lod[[2]] <- data2hist(b2, type = "regular")
lod[[3]] <- data2hist(b3, type = "regular")
lod[[4]] <- data2hist(b4, type = "regular")
lod[[5]] <- data2hist(b5, type = "regular")
lod[[6]] <- data2hist(b6, type = "regular")
lod[[7]] <- data2hist(b7, type = "regular")
lod[[8]] <- data2hist(b8, type = "regular")
lod[[9]] <- data2hist(b9, type = "regular")
lod[[10]] <- data2hist(b10, type = "regular")
lod[[11]] <- data2hist(b11, type = "regular")
lod[[12]] <- data2hist(b12, type = "regular")

mymat2 <- new("MatH", nrows=12, ncols=1, ListOfDist=lod, names.rows=c(1:12), names.cols="density")

WH_kmeans(mymat2, k=3)

# the output of this also gives the expected 3 clusters

确定“最佳”群集数量：

我不确定最佳指标是什么，但此程序包在输出中吐出quality指标。因此，虽然计算几个解决方案然后评估它们是低效的，但使用这个是我最初的解决方案。

示例数据集1的最佳集群：

df = data.frame()
for(i in 2:5) {
  df = rbind(df, data.frame(n_clust = i, quality = WH_kmeans(mymat, k=i)$quality))
}

ggplot(df, aes(x=n_clust, y=quality)) + geom_point(size=4) + geom_line()

该图显示2个集群和3个集群之间的“质量”明显增加，并且在3个集群之上几乎没有改善。所以，我选择3作为“最佳”。这是有道理的，因为我专门创建了原始数据示例以拥有3个集群。

例如2：

df2 = data.frame()
for(i in 2:11) {
  df2 = rbind(df2, data.frame(n_clust = i, quality = WH_kmeans(mymat2, k=i)$quality))
  # this loop errors out after k=6 for me but the answer is already clear.
}

ggplot(df2) + geom_line(aes(x=n_clust, y=quality))

quality的最大增幅再次是从2个群集到3个群集。

有人提出过替代方案吗？这需要很长时间来计算超过2500个直方图的实际数据集的解决方案。同样，我认为在其他具有多个变量直方图的数据集上可能需要太长时间。

使用Earth Movers距离R的聚类直方图

1 个答案: