我在R中有以下表格:
Sample Cluster CellType Condition Genotype Lane
Sample1 1 A Mut XXXX 1
Sample2 2 B Mut YYYY 1
Sample3 2 A Mut YYYY 2
Sample4 1 A Mut ZZZZ 1
Sample5 2 B Mut YYYY 3
Sample6 1 B Mut YYYY 1
Sample7 1 A Mut XXXX 2
我想:
像这样:
Cluster CellType Condition Genotype Lane
1 A (75%) Mut (100%) XXXX (50%) 1 (75%)
2 B (66%) Mut (100%) YYYY (100%) 1 (33%)
我尝试使用如下的聚合函数产生了接近的结果,但它还没有完全存在:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)
答案 0 :(得分:3)
这是基础R解决方案,
m1 <- do.call(rbind,
lapply(split(df, df$Cluster),
function(i) sapply(i[3:6],
function(j) {
t1 <- prop.table(table(j));
nms <- names(t1[which.max(t1)]);
paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
})))
cbind.data.frame(unique(df[2]), m1)
给出,
Cluster CellType Condition Genotype Lane 1 1 A (75%) Mut (100%) XXXX (50%) 1 (75%) 2 2 B (67%) Mut (100%) YYYY (100%) 1 (33%)
答案 1 :(得分:2)
希望这有帮助!
library(dplyr)
df %>%
group_by(Cluster) %>%
summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
rate=(max(table(.))[1]/n())*100))
输出为:
Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
1 1 A Mut XXXX 1 75.0 100 50.0 75.0
2 2 B Mut YYYY 1 66.7 100 100 33.3
或者
df %>%
group_by(Cluster) %>%
summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]),
" (",
rate=round((max(table(.))[1]/n())*100),
"%)")))
# Cluster CellType Condition Genotype Lane
#1 1 A (75%) Mut (100%) XXXX (50%) 1 (75%)
#2 2 B (67%) Mut (100%) YYYY (100%) 1 (33%)
示例数据:
df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4",
"Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L,
2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"),
Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY",
"XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample",
"Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA,
-7L))