如何根据R中最高的中位数从重复项中选择单个行?

时间:2015-08-03 08:20:12

标签: r

我有一个包含基因表达数据的数据框,如下所示:

row.names     symbol     Sample1     Sample2     Sample3     Sample4
Probe1        Gene1      1.5         2.8         1.8         3.2
Probe2        Gene2      2.7         4.5         3.2         5.1
Probe3        Gene3      1.1         4.7         2.3         5.3
Probe4        Gene2      1.2         0.9         0.8         1.1
Probe5        Gene1      3.1         6.1         6.2         4.2

我希望对数据进行子集化,以便只保留唯一的基因,并且在每种情况下,将保留具有最高中位数的探针,即上述数据将变为以下数据:

row.names     symbol     Sample1     Sample2     Sample3     Sample4
Probe2        Gene2      2.7         4.5         3.2         5.1
Probe3        Gene3      1.1         4.7         2.3         5.3
Probe5        Gene1      3.1         6.1         6.2         4.2

数据框有大约40,000个单独探针和~100个样本。

有谁知道R中哪些命令适合该任务?

2 个答案:

答案 0 :(得分:3)

我不会按行计算中位数,而是使用rowMedians包中的向量化matrixStats函数。然后,我会按结果重新排序并使用data.table

选择唯一条目
library(data.table)
library(matrixStats)
df$Medians <- rowMedians(as.matrix(df[-(1:2)]))
unique(setDT(df)[order(-Medians)], by = "symbol")
#    row.names symbol Sample1 Sample2 Sample3 Sample4 Medians
# 1:    Probe5  Gene1     3.1     6.1     6.2     4.2    5.15
# 2:    Probe2  Gene2     2.7     4.5     3.2     5.1    3.85
# 3:    Probe3  Gene3     1.1     4.7     2.3     5.3    3.50

一些基准

library(data.table)
library(matrixStats)
library(dplyr)

set.seed(123)
bigdf <- data.frame(A = paste0("Probe", 1:1e5),
                    symbol = paste0("Gene", sample(1e2, 1e5, replace = TRUE)),
                    matrix(sample(1e2, 1e6, replace = TRUE), ncol = 100))
bigdf2 <- copy(bigdf)
bigdf3 <- copy(bigdf2)

system.time({
  bigdf$Medians <- rowMedians(as.matrix(bigdf[-(1:2)]))
  unique(setDT(bigdf)[order(-Medians)], by = "symbol")
  })

# user  system elapsed 
# 0.22    0.05    0.26 

system.time(setDT(bigdf2)[,.SD[which.max(apply(.SD[,-(1:2), with = FALSE], 1, median)),], by = symbol])
# user  system elapsed 
# 5.17    0.01    5.33 
system.time({
              bigdf3$medianCol <- apply(bigdf3[-(1:2)],1,FUN = median)
              grouped_df <- group_by(bigdf3,symbol)
              filtered_df <- filter(grouped_df, medianCol == max(medianCol))
})
# user  system elapsed 
# 5.15    0.00    5.15 

答案 1 :(得分:1)

或者使用dplyr:

library(dplyr)
df$medianCol <- apply(df[,2:5],1,FUN = median)
grouped_df <- group_by(df,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
filtered_df$medianCol <- NULL