我有一个名为ranges
的df1,如:
1 bin chrom chromStart chromEnd name score
2 12 chr1 836780 856723 -5.7648 599
3 116 chr1 1693001 1739032 -4.8403 473
4 117 chr1 1750780 1880930 -5.3036 536
5 121 chr1 2020123 2108890 -4.4165 415
我还有一个名为viable
的data.frame,如:
chrom chromStart chromEnd N
chr1 840000 890000 1566
chr1 1690000 1740000 1566
chr1 1700000 1750000 1566
chr1 1710000 1760000 1566
chr1 1720000 1770000 1566
chr1 1730000 1780000 1566
chr1 1740000 1790000 1566
chr1 1750000 1800000 1566
chr1 1760000 1810000 1566
基本上我在ranges
中有从chromStart到chromEnd的值范围。我还在df2 viable
中有一个范围列表。 viable
中的范围要小得多。我想测试ranges
的范围,并确保整个范围都在viable
的范围内。我怎么能这样做?
我想要的输出是data.frame,如:
1 bin chrom chromStart chromEnd name score
2 12 chr1 840000 856723 -5.7648 599
3 116 chr1 1693001 1739032 -4.8403 473
6 133 chr1 1750780 1880930 -4.8096 469
答案 0 :(得分:2)
您可以尝试使用GenomicRanges
包。
library(dplyr)
library(GenomicRanges)
这里我们加载示例输入数据。 (这是一种不太优雅的方式 - 我知道......但我很懒,而且崇高的多行编辑让它变得简单。)注意:我不知道“1”列在哪里意思是,但我把它保存在数据中。
ranges <-
rbind(
c("2","12","chr1","836780","856723","-5.7648","599"),
c("3","116","chr1","1693001","1739032","-4.8403","473"),
c("4","117","chr1","1750780","1880930","-5.3036","536"),
c("5","121","chr1","2020123","2108890","-4.4165","415")
) %>%
as.data.frame()
colnames(ranges) <-
c("1","bin","chrom","chromStart","chromEnd","name","score")
viable <-
rbind(
c("chr1","840000","890000","1566"),
c("chr1","1690000","1740000","1566"),
c("chr1","1700000","1750000","1566"),
c("chr1","1710000","1760000","1566"),
c("chr1","1720000","1770000","1566"),
c("chr1","1730000","1780000","1566"),
c("chr1","1740000","1790000","1566"),
c("chr1","1750000","1800000","1566"),
c("chr1","1760000","1810000","1566")
) %>%
as.data.frame()
colnames(viable) <-
c("chrom","chromStart","chromEnd","N")
## Need columns to be integers
ranges <-
ranges %>%
tbl_df() %>%
mutate(
chromStart = chromStart %>% as.character %>% as.integer,
chromEnd = chromEnd %>% as.character %>% as.integer
)
viable <-
viable %>%
tbl_df() %>%
mutate(
chromStart = chromStart %>% as.character %>% as.integer,
chromEnd = chromEnd %>% as.character %>% as.integer
)
findOverlaps
添加bin,name和score列。 (注意,此信息在交叉点期间被删除,因为不一定是1:1映射)完成
gr.ranges <-
makeGRangesFromDataFrame(ranges,
keep.extra.columns = T,
seqnames.field = "chrom",
start.field = "chromStart",
end.field = "chromEnd")
gr.viable <-
makeGRangesFromDataFrame(viable,
keep.extra.columns = T,
seqnames.field = "chrom",
start.field = "chromStart",
end.field = "chromEnd")
# To find the intersects
gr.intersect <-
GenomicRanges::intersect(gr.ranges, gr.viable)
# For linking up the non- chrom,start,end columns
gr.hits <-
GenomicRanges::findOverlaps(gr.intersect, gr.ranges)
output <-
gr.intersect[queryHits(gr.hits)]
mcols(output) <-
mcols(gr.ranges[subjectHits(gr.hits)])
output
# Reformat to dataframe
output %>%
as.data.frame() %>%
select(`1` = X1, bin, chrom = seqnames, chromStart = start, chromEnd = end, name, score)