简单的KMeans模型
数据
library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))
型号
kmeans_k2 <- kmeans(data, 2, 9)
Tidymodel结果
tidy(kmeans_k2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 9.60 10.8 18 255. 1
2 5.22 5.20 22 76.6 2
glance(kmeans_k2)
# A tibble: 1 x 4
totss tot.withinss betweenss iter
<dbl> <dbl> <dbl> <int>
1 836. 331. 505. 1
augment(kmeans_k2, data)
# A tibble: 40 x 3
x y .cluster
<dbl> <dbl> <fct>
1 6.37 5.21 2
2 4.44 4.64 2
3 5.36 5.76 2
4 5.63 4.27 2
5 5.40 3.63 2
6 4.89 5.43 2
7 6.51 4.19 2
8 4.91 6.44 2
9 7.02 4.57 2
10 4.94 5.66 2
# ... with 30 more rows
剪影计算
sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
summary(sil_k2)$si.summary[4]
Mean
0.5466503
使用Purrr的贴图功能嵌套的小动作为多个不同的k创建结果
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
k km_model tidydata glancedata augmentdata
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3 3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 5.63 5.10 26 117. 1
2 11.3 10.7 14 176. 2
问题是,如何将“轮廓”分数添加到嵌套的小标题上??“轮廓”功能需要每个模型的簇,我不确定该怎么做。显然,我可以选择一个实例,例如
data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我尝试将其与地图一起使用时,它不起作用
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.
我可以创建一个函数,该函数可以再次出现
my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
但是当我将其与地图一起使用时会失败。
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.
我怀疑问题与我试图从嵌套模型中检索$ cluster有关,因为我尝试提取它来创建它自己的列,但也无法使它起作用。
答案 0 :(得分:1)
将其作为答案,因为注释实际上并不需要很多代码。
以下对我有用:
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data),
silhouettedata = map(augmentdata, ~ silhouette(as.numeric(levels(.x$.cluster))[.x$.cluster], dist(data))))
unnest(kmeans_k123, silhouettedata)
# A tibble: 81 x 6
k km_model tidydata glancedata augmentdata silhouettedata[,"cluster"] [,"neighbor"] [,"sil_width"]
<int> <list> <list> <list> <list> <dbl> <dbl> <dbl>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> NA NA NA
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.743
3 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.776
4 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.772
5 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.771
6 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.742
7 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.794
8 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.723
9 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.713
10 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]> 2 1 0.683
关于as.numeric(levels(.x$.cluster))[.x$.cluster]
的用法,这是因为broom::tidy()
将簇变量变成一个因数,而cluster::silhouette()
要求簇变量为数字。 This answer提供了为什么使用该特定代码行将数字因子转换为数字变量的原因。