我有一个包含基因表达数据的数据框,如下所示:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe1 Gene1 1.5 2.8 1.8 3.2
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe4 Gene2 1.2 0.9 0.8 1.1
Probe5 Gene1 3.1 6.1 6.2 4.2
我希望对数据进行子集化,以便只保留唯一的基因,并且在每种情况下,将保留具有最高中位数的探针,即上述数据将变为以下数据:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe5 Gene1 3.1 6.1 6.2 4.2
数据框有大约40,000个单独探针和~100个样本。
有谁知道R中哪些命令适合该任务?
答案 0 :(得分:3)
我不会按行计算中位数,而是使用rowMedians
包中的向量化matrixStats
函数。然后,我会按结果重新排序并使用data.table
包
library(data.table)
library(matrixStats)
df$Medians <- rowMedians(as.matrix(df[-(1:2)]))
unique(setDT(df)[order(-Medians)], by = "symbol")
# row.names symbol Sample1 Sample2 Sample3 Sample4 Medians
# 1: Probe5 Gene1 3.1 6.1 6.2 4.2 5.15
# 2: Probe2 Gene2 2.7 4.5 3.2 5.1 3.85
# 3: Probe3 Gene3 1.1 4.7 2.3 5.3 3.50
一些基准
library(data.table)
library(matrixStats)
library(dplyr)
set.seed(123)
bigdf <- data.frame(A = paste0("Probe", 1:1e5),
symbol = paste0("Gene", sample(1e2, 1e5, replace = TRUE)),
matrix(sample(1e2, 1e6, replace = TRUE), ncol = 100))
bigdf2 <- copy(bigdf)
bigdf3 <- copy(bigdf2)
system.time({
bigdf$Medians <- rowMedians(as.matrix(bigdf[-(1:2)]))
unique(setDT(bigdf)[order(-Medians)], by = "symbol")
})
# user system elapsed
# 0.22 0.05 0.26
system.time(setDT(bigdf2)[,.SD[which.max(apply(.SD[,-(1:2), with = FALSE], 1, median)),], by = symbol])
# user system elapsed
# 5.17 0.01 5.33
system.time({
bigdf3$medianCol <- apply(bigdf3[-(1:2)],1,FUN = median)
grouped_df <- group_by(bigdf3,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
})
# user system elapsed
# 5.15 0.00 5.15
答案 1 :(得分:1)
或者使用dplyr:
library(dplyr)
df$medianCol <- apply(df[,2:5],1,FUN = median)
grouped_df <- group_by(df,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
filtered_df$medianCol <- NULL