让我们说我有一个数据表
data = data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2)
)
我将k-means应用于它,得到质心和标签,经过处理后得到
data = data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2),
centroids = c(9.5, 1.5)
)
我们在哪里
cluster_centroids <- c(9.5, 1.5)
cluster_labels <- c(1, 2)
如何按降序重新标记簇列的标签,以使所需的结果如下所示:
data = data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(2, 2, 1, 1),
centroids = c(9.5, 1.5)
)
我想根据质心中的值对它们进行排序。
我希望标签与质心一起显示,人口越高,标签越高。 (对于一般情况,请考虑这一点,其中存在k
个群集,并且cluster
列值中没有顺序。例如,原始群集列的顺序可能像1
人口稠密的城市,k
是人口稠密的第二个城市,等等)
我不想对数据表的行进行排序。我想将NYC的标签从1更改为2,将夏威夷的标签从2更改为1。将(1,2)映射到(2,1),以便将人口最多的城市标记为最大标签,并且人口最少的城市标有1
实际问题中的集群数量不是2。我只是想保持简单。
答案 0 :(得分:3)
dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
) %>% group_by(cluster) %>% #create the centroids variable
mutate(centroid = mean(population)) %>% ungroup()
# implicitly rank the centroids, assigning increasing integers to decreasing vals
#assign the result as the cluster
dt %>% mutate("cluster" = frankv(centroid, ties.method = "dense"))
# A tibble: 4 x 4
city population cluster centroid
<chr> <dbl> <int> <dbl>
1 NYC 10 2 9.5
2 LA 9 2 9.5
3 Hawaii 1 1 1.5
4 Essex 2 1 1.5
您可以使用以下公式对任何变量进行反向编码:max(x)+min(x) - x
dt <- data.table(city = c("NYC", "LA", "Hawaii", "Essex"),
population = c(10, 9, 1, 2),
cluster = c(1, 1, 2, 2)
)
dt %>% mutate_at("cluster", ~max(.)+1-.)
city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1
另一种解决方案是,如果您有两个以上的类别,并且您不想对它们本身进行“反向编码”,则使用case_when
:
dt %>% mutate("cluster" = case_when(cluster == 2 ~ 1, cluster == 1 ~ 2))
city population cluster
1 NYC 10 2
2 LA 9 2
3 Hawaii 1 1
4 Essex 2 1
答案 1 :(得分:2)
我想根据质心中的值对其进行排序
另一种方式:
# OP's input
clusterDT = data.table(old_label = 1:2, centroid = c(9.5, 1.5))
# overwrite labels by sorting and assigning row number
clusterDT[order(centroid), new_label := .I]
# update data
data[, cluster := clusterDT[.SD, on=.(old_label = cluster), x.new_label]]
city population cluster
1: NYC 10 2
2: LA 9 2
3: Hawaii 1 1
4: Essex 2 1