聚类与PAM不隔离聚类,但T-SNE显示良好形成的聚类

时间:2017-08-18 21:25:37

标签: r cluster-analysis unsupervised-learning dimensionality-reduction stochastic-process

我正在构建一个聚类算法,用于我尚未见过的数据,所以我在同一时间使用了一些伪数据。 PAM的结果显示我没有任何孤立的聚类,但使用TSNE的ggplot显示我有很好的聚类。我怀疑这是由于我的假数据。有没有人想过为什么会这样?

以下是数据,请注意,Age和howOld代表不同的东西:

library(dplyr)
library(cluster)
library(Rtsne)
library(ggplot2)

set.seed(1987)
n = 350
clust_dat <- 
data.frame(personId = 1:n,
         networkPref = sample(c("topic", "jobtitle", "orgtype"),
                             size = n, replace = TRUE,
                             prob = c(0.56, 0.20, 0.24)),
         Age = sample(23:65, size = n, replace = TRUE),
         familyImp = sample(c(1, 2, 3, 4, 5), size = n, replace = TRUE, 
                            prob = c(0.02, 0.01, 0.10, 0.4, 0.83)),
         howOld = sample(25:30, size = n, replace = TRUE,
                         prob = c(.40, .30, .20, .05, .03, .02)),
         horror = sample(c("Yes", "No"), size = n, replace = TRUE, 
                         prob = c(0.27, 0.73)),
         sailBoat = sample(c("Yes", "No"), size = n, replace = TRUE, 
                           prob = c(0.58, 0.42)))

首先定义我的序数变量

的级别后,这是我的模型构建
clust_dat$familyImp <- factor(clust_dat$familyImp, 
                          levels = c("1", "2", "3", "4", "5"), 
                          ordered = TRUE)

gower_dist <- daisy(clust_dat[, -1], metric = "gower")
gower_matrix <- as.matrix(gower_dist)

#find silhouette width for many PAM models
sil_width <- c(NA)
for (i in 2:ceiling(nrow(clust_dat)/9)) {
   pam_fit <- pam(gower_dist, 
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

#build PAM model with best silhouette width
pam_fit <- pam(gower_dist, diss = TRUE, k = which.max(sil_width))

在PAM上获取隔离信息时,我得到:

pam_fit$isolation

 1  2  3  4  5  6  7  8  9 10 11 12 
no no no no no no no no no no no no 
Levels: no L L*

但是绘图显示Some Well Formed Clusters

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- 
  tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = clust_dat$personId)

ggplot(tsne_data, aes(x = X, y = Y)) +
  geom_point(aes(color = cluster))

有什么想法吗?如果我删除所有连续变量,我会得到非常明确的聚类,但有些被认为是孤立的......

1 个答案:

答案 0 :(得分:0)

您生成数据的方式,不应该有任何群集超出您使用的分类标签中的工件。根据您使用的频率,我希望8个“群集”对应于属性的简单组合。

如果您生成i.i.d.数据,它不应该集群!

所以我宁愿假设您的可视化问题。

请参阅,例如this answer on the problems of "seeing" clusters in tSNE