R:尝试根据另一个数据框的位置计算一个数据框的货币数量

时间:2019-10-25 13:14:29

标签: r dataframe merge

我有两个数据帧,XY

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850))

我想获取一个data.frame Z,它是一个新的data.frame,其信息为X,并且在每个“ X”行的范围之间的计数为Y 。例如,您可以计算chr1中第一行“ X”的范围之间的3行“ Y”,因此该行的“ Z”中有3行。

Z <- data.frame(V1 = c("chr1", "chr1", "chr2", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950),
                Count = c(3, 0, 0, 1, 1, 1))

我希望得到一些帮助,因为直到今天,如果“ X”数据集只有一行,那么我只能设法打印行数,但是我不知道如何实现我的目标。我想我必须使用一些条件语句以及一个for循环来遍历“ X”的行,但是我不知道该怎么做。

我尝试过的事情:

  1. 试图计算与条件匹配的行数,其中“ Y”中只有一行:

    nrow(Y[Y$Start >= X$Start & Y$Stop <= X$Stop, ])

在“ X”中只有1行时有效,但是当我尝试在for循环中实现它时则无效。

2 个答案:

答案 0 :(得分:3)

您可以使用tidyverse软件包来完成此操作。

首先,我建议选择选项stringsAsFactors = FALSE

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950), stringsAsFactors = F)

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850), stringsAsFactors = F)



library(tidyverse)
X %>%
  mutate(count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1,  Start >= ..2, Stop <=..3) %>% nrow))

    V1 Start Stop count
1 chr1     0  230     3
2 chr1   540  720     0
3 chr1   920 1270     0
4 chr2     0  350     1
5 chr2   582  635     1
6  ch2   715  950     1

答案 1 :(得分:3)

由于您提到自己是生物信息学家,所以我将向您介绍Bioconductor和GenomicRanges软件包,该软件包是专门为此类问题构建的。

library(GenomicRanges)
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850))


x <- GRanges(X$V1, ranges = IRanges(X$Start, X$Stop))
y <- GRanges(Y$V1, ranges = IRanges(Y$Start, Y$Stop))

countOverlaps(x, y)
z <- GRanges(x, count = countOverlaps(x, y))
as.data.frame(z)
#  seqnames start  end width strand count
#1     chr1     0  230   231      *     3
#2     chr1   540  720   181      *     0
#3     chr1   920 1270   351      *     0
#4     chr2     0  350   351      *     1
#5     chr2   582  635    54      *     1
#6      ch2   715  950   236      *     1