我有一个包含2个不同(1个外部运行,1个自己完成)聚类解决方案的数据集。我想使用tanglegram
包中的entanglement
和dendextend
命令对它们进行比较,但是我一直有关于标签的错误,我无法弄清楚原因。为了说明,我使用mtcars编写了一个简单的例子:
df1 <- mtcars
df1$ID <- row.names(mtcars)
clusts <- 1:3
# simulate two different cluster algorithms as columns containing cluster group
df1$cl1 <- sample(clusts, nrow(df1), replace = TRUE)
df1$cl2 <- sample(clusts, nrow(df1), replace = TRUE)
table(df1$cl1, df1$cl2)
# Make a copy
df2 = df1
# Use data.tree to convert df's to data.trees
library(data.tree)
df1$pathString <- paste("Tree1", df1$cl1, df1$ID, sep = "/")
df2$pathString <- paste("Tree2", df2$cl2, df2$ID, sep = "/")
node1 <- as.Node(df1)
node2 <- as.Node(df2)
# Convert to dendrograms and compare using dendextend
library(dendextend)
dend1 <- as.dendrogram(node1)
dend2 <- as.dendrogram(node2)
tanglegram(dend1, dend2)
entanglement(dend1, dend2)
这会产生以下错误:
> tanglegram(dend1, dend2)
Error in dend12[[1]] : subscript out of bounds
In addition: Warning message:
In intersect_trees(dend1, dend2, warn = TRUE) :
The two trees had no common labels!
> entanglement(dend1, dend2)
Error in match_order_by_labels(dend2, dend1) :
labels do not match in both trees. Please make sure to fix the labels names!
(make sure also that the labels of BOTH trees are 'character')
我不明白为什么会出现这些错误并且检查数据结构并没有给我答案!非常感谢任何有用的启示!
修改 注意下面的@ emilliman5的答案:我理解我的树形图没有得到解决 - 我没有使用层次聚类,所以我想比较未解析的树形图。更多 - 我从这个问题中采用了一些代码:How do I manually create a dendrogram (or "hclust") object ? (in R)自己构建树形图 - 尽管没有解决这些问题,这些代码会产生纠结。然而,这不是一个解决方案,因为它太难以推广到不同的参数(我的树深度/分辨率变化,并且试图编写一个函数来编码具有不同嵌套级别的树,这是一条疯狂之路!)。
tree1 <- list()
attributes(tree1) <- list(members=nrow(df1), height=3)
class(tree1) <- "dendrogram"
# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
leaves[[i]] <- which(df1$cl1 == (i) )
}
for(i in 1:length(clusts)){
tree1[[i]] <- list()
attributes(tree1[[i]]) <- list(members=length(which(df1$cl1==i)), height=2, edgetext=i)
for( j in 1:length(leaves[[i]]) ){
tree1[[i]][[j]] <- list()
tree1[[i]][[j]] <- leaves[[i]]
attributes(tree1[[i]][[j]]) <- list(members = 1, height = 1,
label = as.character(leaves[[i]][j]),
leaf = TRUE)
}
}
plot(tree1, center=TRUE)
tree2 <-list();
attributes(tree2) <- list(members=nrow(df2), height=3)
class(tree2) <- "dendrogram"
# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
leaves[[i]] <- which(df2$cl2 == (i) )
}
for(i in 1:length(clusts)){
tree2[[i]] <- list()
attributes(tree2[[i]]) <- list(members=length(which(df2$cl2==i)), height=2, edgetext=i)
for( j in 1:length(leaves[[i]]) ){
tree2[[i]][[j]] <- list()
tree2[[i]][[j]] <- leaves[[i]]
attributes(tree2[[i]][[j]]) <- list(members = 1, height = 1,
label = as.character(leaves[[i]][j]),
leaf = TRUE)
}
}
plot(tree2, center=TRUE)
tanglegram(tree1, tree2)
它很难看,但它只是我想要/需要的。
如果我查看树形图,试图找出其工作原理:
> str(unclass(tree1[[1]][[1]]))
atomic [1:12] 1 8 9 10 11 13 16 22 25 27 ...
- attr(*, "members")= num 1
- attr(*, "height")= num 1
- attr(*, "label")= chr "1"
- attr(*, "leaf")= logi TRUE
你注意到有一个向量。窥视一个hclust派生的树状图,我们看到还有一个矢量/原子:
> str(unclass(as.dendrogram(hclust(dist(df1))))[[1]][[1]])
atomic [1:1] 31
- attr(*, "members")= int 1
- attr(*, "height")= num 0
- attr(*, "label")= chr "Maserati Bora"
- attr(*, "leaf")= logi TRUE
然而,偷看data.tree创建了树形图我注意到没有vector / atomic:
> str(unclass(dend1[[1]][[1]]))
list()
- attr(*, "label")= chr "Mazda RX4"
- attr(*, "members")= num 1
- attr(*, "height")= num 0
- attr(*, "leaf")= logi TRUE
这个缺失的原子会导致问题吗?