我有两个数据框a
和b
。
对于b
中的每一行,我想查找start,end
中a
范围内的start,end
内的所有b
,然后求和start,end
的此特定子集的a
,并将其存储为b
中的新列。我正在使用for
循环,但在R中使用apply
有更有效的方法吗?
# data.frame a
a <- data.frame(chrom=1L, start=as.integer(c(2,4,7,11)), end=as.integer(c(3,6,9,15)))
# chrom start end
# 1 2 3
# 1 4 6
# 1 7 9
# 1 11 15
# data.frame b
b <- data.frame(chr=1L, start=as.integer(c(2,11)), end=as.integer(c(10,20)))
# chrom start end
# 1 2 10
# 1 11 20
# code
result=c()
for (i in 1:dim(b)[1]) {
# find start,end in A that are within
a_subset = a[which(a$chrom == b[i, ]$chrom &
a$start >= b[i, ]$start &
a$end <= b[i, ]$end), ]
result = append(result, sum(a_subset$end - a_subset$start))
}
c = cbind(b, result)
# data.frame c
# chrom start end result
# 1 2 10 5
# 1 11 20 4
答案 0 :(得分:3)
使用sqldf很容易,基础R很烦恼:
R>require(sqldf)
R>b$id <- 1:nrow(b)
R>sqldf("select id, b.chr, sum(a.end - a.start) as diff
from a, b where a.start >= b.start and b.end >= a.end group by id")
id chr diff
1 1 1 5
2 2 1 4