难以比较树状图

时间:2017-02-22 15:56:06

标签: r hierarchical-clustering dendextend

我有一个包含2个不同(1个外部运行,1个自己完成)聚类解决方案的数据集。我想使用tanglegram包中的entanglementdendextend命令对它们进行比较,但是我一直有关于标签的错误,我无法弄清楚原因。为了说明,我使用mtcars编写了一个简单的例子:

df1 <- mtcars
df1$ID <- row.names(mtcars)
clusts <- 1:3

# simulate two different cluster algorithms as columns containing cluster group
df1$cl1 <- sample(clusts, nrow(df1), replace = TRUE)
df1$cl2 <- sample(clusts, nrow(df1), replace = TRUE)
table(df1$cl1, df1$cl2)

# Make a copy
df2 = df1

# Use data.tree to convert df's to data.trees
library(data.tree)
df1$pathString <- paste("Tree1", df1$cl1, df1$ID, sep = "/")
df2$pathString <- paste("Tree2", df2$cl2, df2$ID, sep = "/")

node1 <- as.Node(df1)
node2 <- as.Node(df2)

# Convert to dendrograms and compare using dendextend
library(dendextend)
dend1 <- as.dendrogram(node1)
dend2 <- as.dendrogram(node2)

tanglegram(dend1, dend2)
entanglement(dend1, dend2)

这会产生以下错误:

> tanglegram(dend1, dend2)
Error in dend12[[1]] : subscript out of bounds
In addition: Warning message:
In intersect_trees(dend1, dend2, warn = TRUE) :
  The two trees had no common labels!
> entanglement(dend1, dend2)
Error in match_order_by_labels(dend2, dend1) : 
  labels do not match in both trees.  Please make sure to fix the labels    names!
(make sure also that the labels of BOTH trees are 'character')

我不明白为什么会出现这些错误并且检查数据结构并没有给我答案!非常感谢任何有用的启示!

修改 注意下面的@ emilliman5的答案:我理解我的树形图没有得到解决 - 我没有使用层次聚类,所以我想比较未解析的树形图。更多 - 我从这个问题中采用了一些代码:How do I manually create a dendrogram (or "hclust") object ? (in R)自己构建树形图 - 尽管没有解决这些问题,这些代码会产生纠结。然而,这不是一个解决方案,因为它太难以推广到不同的参数(我的树深度/分辨率变化,并且试图编写一个函数来编码具有不同嵌套级别的树,这是一条疯狂之路!)。

tree1 <- list()
attributes(tree1) <- list(members=nrow(df1), height=3)
class(tree1) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df1$cl1 == (i) )
}
for(i in 1:length(clusts)){
    tree1[[i]] <- list()
    attributes(tree1[[i]]) <- list(members=length(which(df1$cl1==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree1[[i]][[j]] <- list()
        tree1[[i]][[j]] <- leaves[[i]]
        attributes(tree1[[i]][[j]]) <- list(members = 1, height = 1,
                                       label = as.character(leaves[[i]][j]),
                                       leaf = TRUE)
    }
}
plot(tree1, center=TRUE)

tree2 <-list();
attributes(tree2) <- list(members=nrow(df2), height=3)
class(tree2) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df2$cl2 == (i) )
}
for(i in 1:length(clusts)){
    tree2[[i]] <- list()
    attributes(tree2[[i]]) <- list(members=length(which(df2$cl2==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree2[[i]][[j]] <- list()
        tree2[[i]][[j]] <- leaves[[i]]
        attributes(tree2[[i]][[j]]) <- list(members = 1, height = 1,
                                        label = as.character(leaves[[i]][j]),
                                        leaf = TRUE)
    }
}
plot(tree2, center=TRUE)

tanglegram(tree1, tree2)

Ugly tanglegram

它很难看,但它只是我想要/需要的。

如果我查看树形图,试图找出其工作原理:

> str(unclass(tree1[[1]][[1]]))
 atomic [1:12] 1 8 9 10 11 13 16 22 25 27 ...
 - attr(*, "members")= num 1
 - attr(*, "height")= num 1
 - attr(*, "label")= chr "1"
 - attr(*, "leaf")= logi TRUE

你注意到有一个向量。窥视一个hclust派生的树状图,我们看到还有一个矢量/原子:

> str(unclass(as.dendrogram(hclust(dist(df1))))[[1]][[1]])
 atomic [1:1] 31
 - attr(*, "members")= int 1
 - attr(*, "height")= num 0
 - attr(*, "label")= chr "Maserati Bora"
 - attr(*, "leaf")= logi TRUE

然而,偷看data.tree创建了树形图我注意到没有vector / atomic:

> str(unclass(dend1[[1]][[1]]))
 list()
 - attr(*, "label")= chr "Mazda RX4"
 - attr(*, "members")= num 1
 - attr(*, "height")= num 0
 - attr(*, "leaf")= logi TRUE

这个缺失的原子会导致问题吗?

1 个答案:

答案 0 :(得分:1)

问题是你的树不是二分的,就是在你可以遍历的每个节点上有两个以上的分支。在分层聚类中,每个节点应该只有两个分支。请参阅以下两个示例:

这是您示例中的树

enter image description here

这就是解析后的树应该是什么样子

plot(hclust(dist(df1[, 1:11])))

enter image description here