在预分类数据上使用APCluster - 着色树形图和格式良好的输出

时间:2017-06-09 16:18:15

标签: colors reporting

问题1: 我正在尝试使用AggExResult对象上的plot()函数,documentationhttps://cran.r-project.org/web/packages/apcluster/apcluster.pdf)中的集群按预期工作。

在我自己的数据中,我在输入中有一个额外的列,它提供了一个预定义的“目标”用于分类目的,我想知道是否有办法让颜色突出显示树状图标签(例如红色= class 0,blue = class 1)目标类是因子(或字符)。我最终试图直观地显示有多少集群包含“纯”与“混合”类。以下是在线文档中稍微修改过的代码,大致显示了我的输入数据:

cl1Targ <- matrix(nrow=50,ncol=1)
for(c1t in 1:nrow(cl1Targ)){ cl1Targ[c1t]  <- as.factor(0) }
cl2Targ <- matrix(nrow=50,ncol=1)
for(c2t in 1:nrow(cl2Targ)){ cl2Targ[c2t]  <- as.factor(1) }

## create two Gaussian clouds
#cl1 <- cbind(rnorm(50,0.2,0.05),rnorm(50,0.8,0.06))
#cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05))
cl1 <- cbind(rnorm(50,0.2,0.05),rnorm(50,0.8,0.06),cl1Targ)
cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05),cl2Targ)
x <- rbind(cl1,cl2)
colnames(x) <- c('Column 1','Column 2','Class_ID')

## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x, r=2)
## run affinity propagation
apres <- apcluster(sim, q=0.7)
## compute agglomerative clustering from scratch
aggres1 <- aggExCluster(sim)
## plot dendrogram
plot(aggres1, main='aggres1 w/ target') # 

如何根据输入中定义的目标为树形图着色?

问题2: 当我展示()示例数据的APResult时,我看到以下内容:

show(apres)     
APResult object

Number of samples     =  100
Number of iterations  =  165
Input preference      =  -0.01281384
Sum of similarities   =  -0.1222309
Sum of preferences    =  -0.1409522
Net similarity        =  -0.2631832
Number of clusters    =  11

Exemplars:
   8 17 24 37 43 52 58 68 92 95 99
Clusters:
   Cluster 1, exemplar 8:
      7 8 9 25 31 36 39 42 47 48
   Cluster 2, exemplar 17:
      6 11 13 15 17 18 19 23 32 35
   Cluster 3, exemplar 24:
      2 5 10 24 45

当我使用自己的数据时,我看到以下内容(row.names,即基因表达聚类的药物意味着倍数变化值)

show(apclr2q05_mean)

APResult object

Number of samples     =  1045
Number of iterations  =  429
Input preference      =  -390.0822
Sum of similarities   =  -89326.99
Sum of preferences    =  -83477.58
Net similarity        =  -172804.6
Number of clusters    =  214

Exemplars:
   amantadine_58mg6h_fc amiodarone_147mg3d_fc clarithromycin_56mg1d_fc fluconazole_394mg5d_fc ketoconazole_114mg5d_fc ketoconazole_2274mg1d_fc
   pantoprazole_1100mg1d_fc pantoprazole_1100mg3d_fc quetiapine_500mg5d_fc roxithromycin_312mg5d_fc torsemide_3mg3d_fc acetazolamide_250mg3d_fc
Clusters:
   Cluster 1, exemplar amantadine_58mg6h_fc:
      amantadine_58mg6h_fc promazine_100mg1d_fc cyproteroneAcetate_2500mg6h_fc danazol_2g5d_fc ivermectin_7500ug1d_fc letrozole_250mg6h_fc
      mefenamicAcid_93mg3d_fc olanzapine_23mg1d_fc secobarbital_20mg6h_fc zaleplon_100mg3d_fc
   Cluster 2, exemplar amiodarone_147mg3d_fc:
      amiodarone_147mg3d_fc amiodarone_147mg5d_fc aspirin_375mg5d_fc betaNapthoflavone_80mg5d_fc clofibrate_130mg3d_fc finasteride_800mg5d_fc
   Cluster 3, exemplar clarithromycin_56mg1d_fc:
      ciprofloxacin_72mg5d_fc ciprofloxacin_450mg6h_fc clarithromycin_56mg1d_fc clarithromycin_56mg3d_fc clarithromycin_56mg5d_fc
   Cluster 4, exemplar fluconazole_394mg5d_fc:
      fluconazole_394mg5d_fc

我也期望在内容方面,但我想将其格式化以用于报告目的。我试图使用dput()导出它,但我在输出文件中得到了很多额外的不必要信息。我想知道我怎么能够将上面提到的相同类型的信息以及上面提到的对象名称和目标分类器导出到如下所示的表中(并将对象的名称添加到输出中): / p>

Name of object        =  apclr2q05_mean
Number of samples     =  1045
Number of iterations  =  429
Input preference      =  -390.0822
Sum of similarities   =  -89326.99
Sum of preferences    =  -83477.58
Net similarity        =  -172804.6
Number of clusters    =  214

Exemplars:                    Target
    amantadine_58mg6h_fc       1
    amiodarone_147mg3d_fc      1
    clarithromycin_56mg1d_fc   1
    fluconazole_394mg5d_fc     0
    ketoconazole_114mg5d_fc    0
    ketoconazole_2274mg1d_fc   0

Clusters:
   Cluster 1, exemplar amantadine_58mg6h_fc:
     Drug                            Target
     amantadine_58mg6h_fc            1
     promazine_100mg1d_fc            1
     cyproteroneAcetate_2500mg6h_fc  1
     danazol_2g5d_fc                 0
     ivermectin_7500ug1d_fc          0

   Cluster 2, exemplar amiodarone_147mg3d_fc:
     Drug                            Target
     Etc…

非常感谢Ulrich通过电子邮件快速回复这些问题,我们想与社区分享我们的讨论,所以我会让他回答他的解决方案,以便获得他应得的荣誉: - )

作为更新,我尝试实现问题1的答案,示例代码按预期工作,但我无法使用它来处理我的数据。输入数据有两部分。第一个是带有数字测量数据的矩阵,包括列标签和行标签:

> fci[1:3,1:3]
                      M30596_PROBE1 AI231309_PROBE1 NM_012489_PROBE1
amantadine_58mg1d_fc     0.05630744     -0.10441722       0.41873201
amantadine_58mg6h_fc    -0.42780274     -0.26222322       0.02703001
amantadine_220mg1d_fc    0.35260779     -0.09902214       0.04067055

第二个是因子格式的“目标”值,每个值对应于上面fci中的同一行:

> targs[1:3]
 amantadine_58mg1d_fc  amantadine_58mg6h_fc amantadine_220mg1d_fc 
                    0                     0                     0 
Levels: 0 1

从这里开始,树构建如下:

# build the AggExResult:
aglomr1 <- aggExCluster(negDistMat(r=2), fci)

# convert the data
tree <- as.dendrogram(aglomr1)

# assign the color codes
colorCodes <- c("0"="red", "1"="green")
names(targs)  <- rownames(fci)
xColor <- colorCodes[as.character(targs)]
names(xColor) <- rownames(fci)

# plot the colored tree
labels_colors(tree) <- xColor[order.dendrogram(tree)]
plot(tree, main="Colored Tree")

树生成但树叶没有着色。做一些挖掘:

> head(xColor)
    0     0     0     0     0     0 
"red" "red" "red" "red" "red" "red" 

对于具有正确颜色分配的目标,该部分似乎按预期工作,但是rownames不在xColor中,而行 labels_colors(树)&lt; - xColor [order.dendrogram(tree) ] 不返回类似的标签,而是返回行号或NAs:

> head(order.dendrogram(tree))
[1] "295" "929" "488" "493" "233" "235"

> head(labels_colors(tree))
295 929 488 493 233 235 

> head(xColor[order.dendrogram(tree)])
<NA> <NA> <NA> <NA> <NA> <NA> 
 NA   NA   NA   NA   NA   NA 

我如何获得 labels_colors(tree)&lt; - xColor [order.dendrogram(tree)] 这一行的行为方式与提供的示例相同?具体来说,我想要展示的是像 amantadine_58mg1d_fc 这样的叶子标签以与目标(0/1)对应的颜色突出显示。

2 个答案:

答案 0 :(得分:0)

以下是我对问题1 的回答:&{39; plot()&#39;的AggExResult方法对象内部使用plot.dendrogram()方法。由于此方法不允许对树状图的叶子着色,因此这不起作用。但是,有{&#39; dendextend&#39;包提供这样的功能。 (顺便说一句,我在另一个帖子中找到了解决方案:Label and color leaf dendrogram in r)因为&#39; apcluster&#39;为{&#39; hclust&#39;提供一些演员。和&#39; dendrogram&#39;对象,这个包的功能可以或多或少地直接使用。

所以,这里有一些示例代码:

library(apcluster)

## create two Gaussian clouds along with class labels 0/1
cl1 <- cbind(rnorm(50, 0.2, 0.05), rnorm(50, 0.8, 0.06))
cl2 <- cbind(rnorm(50, 0.7, 0.08), rnorm(50, 0.3, 0.05))
x <- cbind(Columns=data.frame(rbind(cl1, cl2)),
           "Class_ID"=factor(as.character(c(rep(0, 50), rep(1, 50)))))

## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x[, 1:2], r=2)

## compute agglomerative clustering from scratch
aggres1 <- aggExCluster(sim)

## load 'dendextend' package
## install.packages("dendextend") ## if not yet installed
library(dendextend)

## convert object
tree <- as.dendrogram(aggres1)

## assign color codes
colorCodes <- c("0"="red", "1"="green")
xColor <- colorCodes[x$Class_ID]
names(xColor) <- rownames(x)

## plot color-labeled tree
labels_colors(tree) <- xColor[order.dendrogram(tree)]
plot(tree)

答案 1 :(得分:0)

以下是我对问题2 的回答:很抱歉,“apcluster”包中没有实现此类功能。由于这是一个非常特殊的请求,我不愿意将它包含在包中(更不用说show()方法不能有其他参数的事实)。所以,或者,我想为您提供一个自定义函数,允许标记/分组样本和样本:

library(apcluster)

## create two Gaussian clouds along with class labels 0/1
cl1 <- cbind(rnorm(50, 0.2, 0.05), rnorm(50, 0.8, 0.06))
cl2 <- cbind(rnorm(50, 0.7, 0.08), rnorm(50, 0.3, 0.05))
x <- cbind(Columns=data.frame(rbind(cl1, cl2)),
           "Class_ID"=factor(as.character(c(rep(0, 50), rep(1, 50)))))

## compute similarity matrix (negative squared Euclidean)
sim <- negDistMat(x[, 1:2], r=2)

## special show() function with labeled data
show.ExClust.labeled <- function(object, labels=NULL)
{
    if (!is(object, "ExClust"))
        stop("'object' is not of class 'ExClust'")

    if (is.null(labels))
    {
        show(object)
        return(invisible(NULL))
    }

    cat("\n", class(object), " object\n", sep="")

    if (!is.finite(object@l) || !is.finite(object@it))
        stop("object is not result of an affinity propagation run; ",
             "it is pointless to create 'APResult' objects yourself.")

    cat("\nNumber of samples     = ", object@l, "\n")
    if (length(object@sel) > 0)
    {
        cat("Number of sel samples = ", length(object@sel),
            paste("   (", round(100*length(object@sel)/object@l,1),
                  "%)\n", sep=""))
        cat("Number of sweeps      = ", object@sweeps, "\n")
    }
    cat("Number of iterations  = ", object@it, "\n")
    cat("Input preference      = ", object@p, "\n")
    cat("Sum of similarities   = ", object@dpsim, "\n")
    cat("Sum of preferences    = ", object@expref, "\n")
    cat("Net similarity        = ", object@netsim, "\n")
    cat("Number of clusters    = ", length(object@exemplars), "\n\n")

    if (length(object@exemplars) > 0)
    {
        if (length(names(object@exemplars)) == 0)
        {
            cat("Exemplars:\n")
            df <- data.frame("Sample"=object@exemplars,
                             Label=labels[object@exemplars])
            print(df, row.names=FALSE)

            for (i in 1:length(object@exemplars))
            {
                cat("\nCluster ", i, ", exemplar ",
                    object@exemplars[i], ":\n", sep="")

                df <- data.frame(Sample=object@clusters[[i]],
                                 Label=labels[object@clusters[[i]]])
                print(df, row.names=FALSE)
            }
        }
        else
        {
            df <- data.frame("Exemplars"=names(object@exemplars),
                             Label=labels[names(object@exemplars)])
            print(df, row.names=FALSE)

            for (i in 1:length(object@exemplars))
            {
                cat("\nCluster ", i, ", exemplar ",
                    names(object@exemplars)[i], ":\n", sep="")

                df <- data.frame(Sample=names(object@clusters[[i]]),
                               Label=labels[names(object@clusters[[i]])])
                print(df, row.names=FALSE)
            }
        }
    }
    else
    {
        cat("No clusters identified.\n")
    }
}


## create label vector (with proper names)
label <- x$Class_ID
names(label) <- rownames(x)

## run apcluster()
apres <- apcluster(sim, q=0.3)

## show with labels
show.ExClust.labeled(apres, label)