我有一个非常大的数据集,如下所示:
y <- structure(list(external_gene_name = structure(c(1L, 1L, 1L, 6L,
6L, 4L, 3L, 5L, 5L, 2L), .Label = c("FAM87B", "ISG15", "KLHL17",
"NOC2L", "PLEKHN1", "SAMD11"), class = "factor"), shortestDistance = c(99L,
2L, 1552L, 885L, 1677L, 2160L, 882L, 421L, 497L, 1246L), A = c(8.388,
9.913, 22.876, 19.826, 25.163, 122.766, 122.766, 40.414, 16.013,
155.554), B = c(3.403, 0.851, 0.851, 33.179, 26.373, 80.821,
80.821, 8.507, 17.015, 165.045), C = c(0.541, 1.622, 11.892,
2.162, 3.243, 17.298, 17.298, 0.541, 1.081, 33.515)), .Names = c("external_gene_name",
"shortestDistance", "A", "B", "C"), row.names = c(5L, 7L, 8L,
19L, 20L, 21L, 22L, 23L, 25L, 31L), class = "data.frame")
# external_gene_name shortestDistance A B C
# 5 FAM87B 99 8.388 3.403 0.541
# 7 FAM87B 2 9.913 0.851 1.622
# 8 FAM87B 1552 22.876 0.851 11.892
# 19 SAMD11 885 19.826 33.179 2.162
# 20 SAMD11 1677 25.163 26.373 3.243
# 21 NOC2L 2160 122.766 80.821 17.298
# 22 KLHL17 882 122.766 80.821 17.298
# 23 PLEKHN1 421 40.414 8.507 0.541
# 25 PLEKHN1 497 16.013 17.015 1.081
# 31 ISG15 1246 155.554 165.045 33.515
我需要对此数据进行子集,以便对于external_gene_name
的每个因子级别,仅保留具有最小shortestDistance
的条目。我找到了这个实现:
library(plyr)
ddply(y, .(external_gene_name), summarise, shortestDistance=min(shortestDistance))
# external_gene_name shortestDistance
# 1 FAM87B 2
# 2 ISG15 1246
# 3 KLHL17 882
# 4 NOC2L 2160
# 5 PLEKHN1 421
# 6 SAMD11 885
但是,这仅保留分组因子列和汇总列。我需要保留原始数据集中的所有列。为实现这一目标,我还想出了这个实现:
do.call(rbind, by(y, y$external_gene_name, function(z) z[which.min(z$shortestDistance), ] ))
# external_gene_name shortestDistance A B C
# FAM87B FAM87B 2 9.913 0.851 1.622
# ISG15 ISG15 1246 155.554 165.045 33.515
# KLHL17 KLHL17 882 122.766 80.821 17.298
# NOC2L NOC2L 2160 122.766 80.821 17.298
# PLEKHN1 PLEKHN1 421 40.414 8.507 0.541
# SAMD11 SAMD11 885 19.826 33.179 2.162
这将返回原始列完整的数据。但是,这两种实现都非常慢。我的完整数据集包括可变数量的样本(例如A
到C
,50 +)和~15,000 external_gene_name
个因子水平,每个都有1-20 +个观察值。以这种方式处理单个数据集需要15分钟,我需要处理500多个这样的数据集。有没有更好的方法来实现这一目标?我已经看到类似用例的data.table
实现据称速度更快,但我无法弄清楚如何实现它们。
答案 0 :(得分:2)
以下是您可以尝试的data.table
方式之一:
library(data.table)
setDT(y)
y[, .SD[shortestDistance == min(shortestDistance)], .(external_gene_name)]
# external_gene_name shortestDistance A B C
# 1: FAM87B 2 9.913 0.851 1.622
# 2: SAMD11 885 19.826 33.179 2.162
# 3: NOC2L 2160 122.766 80.821 17.298
# 4: KLHL17 882 122.766 80.821 17.298
# 5: PLEKHN1 421 40.414 8.507 0.541
# 6: ISG15 1246 155.554 165.045 33.515
可能更快的另一个选择是:
setkey(y, external_gene_name, shortestDistance)
y[, head(.SD, 1), .(external_gene_name)]
# external_gene_name shortestDistance A B C
# 1: FAM87B 2 9.913 0.851 1.622
# 2: ISG15 1246 155.554 165.045 33.515
# 3: KLHL17 882 122.766 80.821 17.298
# 4: NOC2L 2160 122.766 80.821 17.298
# 5: PLEKHN1 421 40.414 8.507 0.541
# 6: SAMD11 885 19.826 33.179 2.162
答案 1 :(得分:1)
这是一个dplyr
解决方案,
library(dplyr)
y %>%
group_by(external_gene_name) %>%
do(.[which.min(.$shortestDistance), ])
答案 2 :(得分:1)
我们order
可以通过'shortestDistance after grouping by 'external_gene_name' and use
unique`获取每个组的第一行。
library(data.table)
unique(setDT(y)[order(external_gene_name, shortestDistance)], by = "external_gene_name")
# external_gene_name shortestDistance A B C
#1: FAM87B 2 9.913 0.851 1.622
#2: ISG15 1246 155.554 165.045 33.515
#3: KLHL17 882 122.766 80.821 17.298
#4: NOC2L 2160 122.766 80.821 17.298
#5: PLEKHN1 421 40.414 8.507 0.541
#6: SAMD11 885 19.826 33.179 2.162
或另一个选项是.I
setDT(y)[y[, .I[shortestDistance == min(shortestDistance)], .(external_gene_name)]$V1]
或使用dplyr
library(dplyr)
y %>%
group_by(external_gene_name) %>%
filter(shortestDistance == min(shortestDistance)
或者
y %>%
group_by(external_gene_name) %>%
slice(which.min(shortestDistance))
或base R
选项
y[!duplicated(y[with(y, order(external_gene_name, shortestDistance)),1]),]