我有两个数据帧,X
和Y
。
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850))
我想获取一个data.frame Z
,它是一个新的data.frame,其信息为X
,并且在每个“ X”行的范围之间的计数为Y
。例如,您可以计算chr1中第一行“ X”的范围之间的3行“ Y”,因此该行的“ Z”中有3行。
Z <- data.frame(V1 = c("chr1", "chr1", "chr2", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950),
Count = c(3, 0, 0, 1, 1, 1))
我希望得到一些帮助,因为直到今天,如果“ X”数据集只有一行,那么我只能设法打印行数,但是我不知道如何实现我的目标。我想我必须使用一些条件语句以及一个for循环来遍历“ X”的行,但是我不知道该怎么做。
我尝试过的事情:
试图计算与条件匹配的行数,其中“ Y”中只有一行:
nrow(Y[Y$Start >= X$Start & Y$Stop <= X$Stop, ])
在“ X”中只有1行时有效,但是当我尝试在for循环中实现它时则无效。
答案 0 :(得分:3)
您可以使用tidyverse
软件包来完成此操作。
首先,我建议选择选项stringsAsFactors = FALSE
。
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950), stringsAsFactors = F)
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850), stringsAsFactors = F)
library(tidyverse)
X %>%
mutate(count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1, Start >= ..2, Stop <=..3) %>% nrow))
V1 Start Stop count
1 chr1 0 230 3
2 chr1 540 720 0
3 chr1 920 1270 0
4 chr2 0 350 1
5 chr2 582 635 1
6 ch2 715 950 1
答案 1 :(得分:3)
由于您提到自己是生物信息学家,所以我将向您介绍Bioconductor和GenomicRanges
软件包,该软件包是专门为此类问题构建的。
library(GenomicRanges)
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850))
x <- GRanges(X$V1, ranges = IRanges(X$Start, X$Stop))
y <- GRanges(Y$V1, ranges = IRanges(Y$Start, Y$Stop))
countOverlaps(x, y)
z <- GRanges(x, count = countOverlaps(x, y))
as.data.frame(z)
# seqnames start end width strand count
#1 chr1 0 230 231 * 3
#2 chr1 540 720 181 * 0
#3 chr1 920 1270 351 * 0
#4 chr2 0 350 351 * 1
#5 chr2 582 635 54 * 1
#6 ch2 715 950 236 * 1