Dendextend:关于如何根据定义的组为树形图的标签着色

时间:2017-07-20 14:18:47

标签: r hierarchical-clustering dendextend

我正在尝试使用名为dendextend的令人敬畏的R-package来绘制树状图并为其分支和颜色着色。根据一组先前定义的组标签。 我已经在Stack Overflow中看到了你的答案,以及dendextend小插图的常见问题解答,但我仍然不确定如何实现我的目标。

我们假设我有一个数据框,第一列包含用于聚类的个体名称,然后是几列包含要分析的因子,最后一列包含每个人的组信息(参见下表)。

individual  282856  282960  283275  283503  283572  283614  284015  group
pat15612    0   0   0   0   0   0   0   g2
pat38736    0   0   0   0   0   0   0   g2
pat38740    0   0   0   0   0   1   0   g2
pat38742    0   0   0   0   0   1   0   g4
pat38743    0   0   1   0   0   1   0   g3
pat38745    0   0   1   0   1   0   0   g4
pat38750    0   0   0   1   0   1   0   g4
pat38753    0   0   0   1   0   0   0   g3
pat40120    0   0   0   0   1   0   0   g4
pat40124    0   0   0   0   1   0   0   g4
pat40125    0   0   0   0   1   1   0   g4
pat40126    0   0   0   1   0   0   0   g4
pat40137    1   0   0   0   0   0   0   g4
pat40142    0   1   0   0   0   0   0   g5
pat46903    0   0   0   0   0   1   0   g1
pat67612    1   0   0   0   1   0   0   g1
pat67621    0   0   0   0   0   0   0   g2
pat67630    0   0   1   0   0   0   0   g2
pat67634    0   0   0   0   0   0   0   g5
pat67657    0   1   0   1   0   0   0   g5
pat67680    0   0   0   0   0   1   0   g5
pat67683    0   0   1   1   0   0   0   g6

如何根据他们所属的组对代表每个人的分支和标签进行着色,即使他们可能聚集在不同的区域中?

如果可以实现这一点,有没有办法定义分配给每个组的颜色?

2 个答案:

答案 0 :(得分:2)

我很高兴你自己解决了这个问题。 更简单的解决方案是在order_value = TRUE函数中使用set参数。例如:

library(dendextend)
iris2 <- iris[,-5]
rownames(iris2) <- paste(iris[,5],iris[,5],iris[,5], rownames(iris2))
dend <- iris2 %>% dist %>% hclust %>% as.dendrogram
dend <- dend %>% set("labels_colors", as.numeric(iris[,5]), order_value = TRUE) %>%
        set("labels_cex", .5)
par(mar = c(4,1,0,8))
plot(dend, horiz = T)

将导致(如您所见,标签的颜色基于虹膜数据集中的其他变量“Species”):

enter image description here

(p.s:我将物种出现的次数增加了三倍,以便更容易看出颜色与标签长度的关系)

答案 1 :(得分:1)

我能够使用另一个名为&#34; sparcl&#34;的软件包来完成它。我是根据上一篇文章(How to colour the labels of a dendrogram by an additional factor variable in R)做到的。

这是我的代码:

#load the dataset.....
#calculate distances
d <- dist(dataset2, method="Jaccard")
## Hierarchical cluster the data
hc <- hclust(d)
dend <- as.dendrogram(hc)
#create labels
labs=dataset$individual
#format to dendrogram
hcd = as.dendrogram(hc)                             
plot(hcd, cex=0.6)
# factor variable for colours                                  
Var = dataset$group   
# convert numbers to colours                                    
varCol = gsub("g1.*","green",Var)                        
varCol = gsub("g2.*","gold",varCol)
varCol = gsub("g3.*","pink",varCol)                        
varCol = gsub("g4.*","purple",varCol)
varCol = gsub("g5.*","blue",varCol)                        
varCol = gsub("g6.*","red",varCol)
#colour-code dendrogram branches by a factor 
library(sparcl)
ColorDendrogram(hc, y=varCol, branchlength=0.9, labels=labs,
            xlab="", ylab="", sub="")  

最后,我设法推断了一个&#34; dendextend&#34;基于此帖子示例的包解决方案(How to colour the labels of a dendrogram by an additional factor variable in R):

# install.packages("dendextend")
library(dendextend)

#load the dataset.....
dataset2<-dataset[,1:7]#same dataset as in the example

#calculate the dendrogram
dend <- as.dendrogram(hclust(dist(dataset2)))

#capture the colors from the "group" column
colors_to_use <- as.numeric(dataset$group)
colors_to_use

# sort the colors based on their order in dend:
colors_to_use <- colors_to_use[order.dendrogram(dend)]
colors_to_use

#Apply colors 
labels_colors(dend) <- colors_to_use

# Patient labels have a color based on their group
labels_colors(dend) 
plot(dend, main = "Color in labels")