在R中聚合分类表(带百分比)

时间:2018-03-16 09:47:56

标签: r dplyr aggregation

我在R中有以下表格:

Sample             Cluster  CellType  Condition  Genotype  Lane
Sample1            1        A         Mut        XXXX      1
Sample2            2        B         Mut        YYYY      1
Sample3            2        A         Mut        YYYY      2
Sample4            1        A         Mut        ZZZZ      1
Sample5            2        B         Mut        YYYY      3
Sample6            1        B         Mut        YYYY      1
Sample7            1        A         Mut        XXXX      2

我想:

  • 按群集列
  • 汇总表格
  • 其他列产生与集群相关的主导值
  • 以及“置信水平”,作为与同一群组相关的价值占优势的百分比

像这样:

Cluster      CellType  Condition  Genotype     Lane
1            A (75%)   Mut (100%) XXXX (50%)   1 (75%)
2            B (66%)   Mut (100%) YYYY (100%)  1 (33%)

我尝试使用如下的聚合函数产生了接近的结果,但它还没有完全存在:

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)

2 个答案:

答案 0 :(得分:3)

这是基础R解决方案,

m1 <- do.call(rbind, 
        lapply(split(df, df$Cluster), 
               function(i) sapply(i[3:6], 
                                  function(j) {
                                    t1 <- prop.table(table(j)); 
                                    nms <- names(t1[which.max(t1)]); 
                                    paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
                                    })))

cbind.data.frame(unique(df[2]), m1)

给出,

Cluster CellType  Condition    Genotype    Lane
1       1  A (75%) Mut (100%)  XXXX (50%) 1 (75%)
2       2  B (67%) Mut (100%) YYYY (100%) 1 (33%)

答案 1 :(得分:2)

希望这有帮助!

library(dplyr)

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
                                         rate=(max(table(.))[1]/n())*100))

输出为:

  Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
1       1 A            Mut           XXXX         1                 75.0            100          50.0      75.0
2       2 B            Mut           YYYY         1                 66.7            100         100        33.3

或者

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]), 
                                                " (",
                                                rate=round((max(table(.))[1]/n())*100), 
                                                "%)")))

#  Cluster CellType Condition  Genotype    Lane   
#1       1 A (75%)  Mut (100%) XXXX (50%)  1 (75%)
#2       2 B (67%)  Mut (100%) YYYY (100%) 1 (33%)

示例数据:

df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4", 
"Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L, 
2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"), 
    Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
    ), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY", 
    "XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample", 
"Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA, 
-7L))