如何在R中找到这些范围的重叠值?

时间:2016-10-06 04:48:32

标签: r dataframe range

我有一个名为ranges的df1,如:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   836780    856723    -5.7648   599
3    116  chr1   1693001   1739032   -4.8403   473
4    117  chr1   1750780   1880930   -5.3036   536
5    121  chr1   2020123   2108890   -4.4165   415

我还有一个名为viable的data.frame,如:

   chrom   chromStart  chromEnd        N
chr1      840000       890000       1566
chr1      1690000      1740000      1566
chr1      1700000      1750000      1566
chr1      1710000      1760000      1566
chr1      1720000      1770000      1566
chr1      1730000      1780000      1566
chr1      1740000      1790000      1566
chr1      1750000      1800000      1566
chr1      1760000      1810000      1566

基本上我在ranges中有从chromStart到chromEnd的值范围。我还在df2 viable中有一个范围列表。 viable中的范围要小得多。我想测试ranges的范围,并确保整个范围都在viable的范围内。我怎么能这样做?

我想要的输出是data.frame,如:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   840000    856723    -5.7648   599
3    116  chr1   1693001   1739032   -4.8403   473
6    133  chr1   1750780   1880930   -4.8096   469

1 个答案:

答案 0 :(得分:2)

您可以尝试使用GenomicRanges包。

library(dplyr)
library(GenomicRanges)

这里我们加载示例输入数据。 (这是一种不太优雅的方式 - 我知道......但我很懒,而且崇高的多行编辑让它变得简单。)注意:我不知道“1”列在哪里意思是,但我把它保存在数据中。

ranges <-
  rbind(
    c("2","12","chr1","836780","856723","-5.7648","599"),
    c("3","116","chr1","1693001","1739032","-4.8403","473"),
    c("4","117","chr1","1750780","1880930","-5.3036","536"),
    c("5","121","chr1","2020123","2108890","-4.4165","415")
  ) %>% 
  as.data.frame()
colnames(ranges) <-
  c("1","bin","chrom","chromStart","chromEnd","name","score")

viable <-
  rbind(
    c("chr1","840000","890000","1566"),
    c("chr1","1690000","1740000","1566"),
    c("chr1","1700000","1750000","1566"),
    c("chr1","1710000","1760000","1566"),
    c("chr1","1720000","1770000","1566"),
    c("chr1","1730000","1780000","1566"),
    c("chr1","1740000","1790000","1566"),
    c("chr1","1750000","1800000","1566"),
    c("chr1","1760000","1810000","1566")
  ) %>%
  as.data.frame()
colnames(viable) <-
  c("chrom","chromStart","chromEnd","N")

## Need columns to be integers
ranges <-
  ranges %>%
  tbl_df() %>%
  mutate(
    chromStart = chromStart %>% as.character %>% as.integer,
    chromEnd = chromEnd %>% as.character %>% as.integer
    )
viable <-
  viable %>%
  tbl_df() %>%
  mutate(
    chromStart = chromStart %>% as.character %>% as.integer,
    chromEnd = chromEnd %>% as.character %>% as.integer
    )

这是我的答案开始的地方:

  1. 将数据帧重新格式化为GenomicRanges类
  2. 通过交叉点找到区域
  3. 使用findOverlaps添加bin,name和score列。 (注意,此信息在交叉点期间被删除,因为不一定是1:1映射)
  4. 将输出重新格式化为数据帧
  5. 完成

    gr.ranges <-
      makeGRangesFromDataFrame(ranges,
                               keep.extra.columns = T,
                               seqnames.field = "chrom",
                               start.field = "chromStart",
                               end.field = "chromEnd")
    gr.viable <-
      makeGRangesFromDataFrame(viable,
                               keep.extra.columns = T,
                               seqnames.field = "chrom",
                               start.field = "chromStart",
                               end.field = "chromEnd")
    
    # To find the intersects
    gr.intersect <-
      GenomicRanges::intersect(gr.ranges, gr.viable)
    
    # For linking up the non- chrom,start,end columns
    gr.hits <-
      GenomicRanges::findOverlaps(gr.intersect, gr.ranges)
    
    output <-
      gr.intersect[queryHits(gr.hits)]
    mcols(output) <-
      mcols(gr.ranges[subjectHits(gr.hits)])
    output
    
    # Reformat to dataframe
    output %>%
      as.data.frame() %>%
      select(`1` = X1, bin, chrom = seqnames, chromStart = start, chromEnd = end, name, score)