快速将数据帧子集化为因子级别的最小值

时间:2016-07-13 00:49:02

标签: r

我有一个非常大的数据集,如下所示:

y <- structure(list(external_gene_name = structure(c(1L, 1L, 1L, 6L, 
      6L, 4L, 3L, 5L, 5L, 2L), .Label = c("FAM87B", "ISG15", "KLHL17", 
      "NOC2L", "PLEKHN1", "SAMD11"), class = "factor"), shortestDistance = c(99L, 
      2L, 1552L, 885L, 1677L, 2160L, 882L, 421L, 497L, 1246L), A = c(8.388, 
      9.913, 22.876, 19.826, 25.163, 122.766, 122.766, 40.414, 16.013, 
      155.554), B = c(3.403, 0.851, 0.851, 33.179, 26.373, 80.821, 
      80.821, 8.507, 17.015, 165.045), C = c(0.541, 1.622, 11.892, 
      2.162, 3.243, 17.298, 17.298, 0.541, 1.081, 33.515)), .Names = c("external_gene_name", 
      "shortestDistance", "A", "B", "C"), row.names = c(5L, 7L, 8L, 
      19L, 20L, 21L, 22L, 23L, 25L, 31L), class = "data.frame")


# external_gene_name shortestDistance       A       B      C
# 5              FAM87B               99   8.388   3.403  0.541
# 7              FAM87B                2   9.913   0.851  1.622
# 8              FAM87B             1552  22.876   0.851 11.892
# 19             SAMD11              885  19.826  33.179  2.162
# 20             SAMD11             1677  25.163  26.373  3.243
# 21              NOC2L             2160 122.766  80.821 17.298
# 22             KLHL17              882 122.766  80.821 17.298
# 23            PLEKHN1              421  40.414   8.507  0.541
# 25            PLEKHN1              497  16.013  17.015  1.081
# 31              ISG15             1246 155.554 165.045 33.515

我需要对此数据进行子集,以便对于external_gene_name的每个因子级别,仅保留具有最小shortestDistance的条目。我找到了这个实现:

library(plyr)
ddply(y, .(external_gene_name), summarise, shortestDistance=min(shortestDistance))
# external_gene_name shortestDistance
# 1             FAM87B                2
# 2              ISG15             1246
# 3             KLHL17              882
# 4              NOC2L             2160
# 5            PLEKHN1              421
# 6             SAMD11              885

但是,这仅保留分组因子列和汇总列。我需要保留原始数据集中的所有列。为实现这一目标,我还想出了这个实现:

do.call(rbind, by(y, y$external_gene_name, function(z) z[which.min(z$shortestDistance), ] ))
# external_gene_name shortestDistance       A       B      C
# FAM87B              FAM87B                2   9.913   0.851  1.622
# ISG15                ISG15             1246 155.554 165.045 33.515
# KLHL17              KLHL17              882 122.766  80.821 17.298
# NOC2L                NOC2L             2160 122.766  80.821 17.298
# PLEKHN1            PLEKHN1              421  40.414   8.507  0.541
# SAMD11              SAMD11              885  19.826  33.179  2.162

这将返回原始列完整的数据。但是,这两种实现都非常慢。我的完整数据集包括可变数量的样本(例如AC,50 +)和~15,000 external_gene_name个因子水平,每个都有1-20 +个观察值。以这种方式处理单个数据集需要15分钟,我需要处理500多个这样的数据集。有没有更好的方法来实现这一目标?我已经看到类似用例的data.table实现据称速度更快,但我无法弄清楚如何实现它们。

3 个答案:

答案 0 :(得分:2)

以下是您可以尝试的data.table方式之一:

library(data.table)
setDT(y)
y[, .SD[shortestDistance == min(shortestDistance)], .(external_gene_name)]
#    external_gene_name shortestDistance       A       B      C
# 1:             FAM87B                2   9.913   0.851  1.622
# 2:             SAMD11              885  19.826  33.179  2.162
# 3:              NOC2L             2160 122.766  80.821 17.298
# 4:             KLHL17              882 122.766  80.821 17.298
# 5:            PLEKHN1              421  40.414   8.507  0.541
# 6:              ISG15             1246 155.554 165.045 33.515

可能更快的另一个选择是:

setkey(y, external_gene_name, shortestDistance)
y[, head(.SD, 1), .(external_gene_name)]
#    external_gene_name shortestDistance       A       B      C
# 1:             FAM87B                2   9.913   0.851  1.622
# 2:              ISG15             1246 155.554 165.045 33.515
# 3:             KLHL17              882 122.766  80.821 17.298
# 4:              NOC2L             2160 122.766  80.821 17.298
# 5:            PLEKHN1              421  40.414   8.507  0.541
# 6:             SAMD11              885  19.826  33.179  2.162

答案 1 :(得分:1)

这是一个dplyr解决方案,

library(dplyr)
y %>%
  group_by(external_gene_name) %>%
  do(.[which.min(.$shortestDistance), ])

答案 2 :(得分:1)

我们order可以通过'shortestDistance after grouping by 'external_gene_name' and use unique`获取每个组的第一行。

library(data.table)
unique(setDT(y)[order(external_gene_name, shortestDistance)], by = "external_gene_name")
#   external_gene_name shortestDistance       A       B      C
#1:             FAM87B                2   9.913   0.851  1.622
#2:              ISG15             1246 155.554 165.045 33.515
#3:             KLHL17              882 122.766  80.821 17.298
#4:              NOC2L             2160 122.766  80.821 17.298
#5:            PLEKHN1              421  40.414   8.507  0.541
#6:             SAMD11              885  19.826  33.179  2.162

或另一个选项是.I

setDT(y)[y[, .I[shortestDistance == min(shortestDistance)], .(external_gene_name)]$V1]

或使用dplyr

library(dplyr)
y %>%
  group_by(external_gene_name) %>%
 filter(shortestDistance == min(shortestDistance)

或者

y %>%
  group_by(external_gene_name) %>%
  slice(which.min(shortestDistance))

base R选项

y[!duplicated(y[with(y, order(external_gene_name, shortestDistance)),1]),]