只有当INTERVALS(开始/结束)包含(地址)“地图”表的至少一个“位置”(开始)时,我才需要保存“ref”中的行:
按照“ref”表的例子:
ref<-"chr start end
chr1 1 10
chr1 20 30
chr1 30 40
chr1 40 50
chr2 20 30
chr2 40 50
chr2 80 90"
ref<-read.table(text=ref,header=T)
按照“地图”表的示例:
map<-"chr start
chr1 1
chr1 3
chr1 5
chr1 31
chr1 32
chr2 1
chr2 2
chr2 89"
map<-read.table(text=map,header=T)
我需要一个这样的决赛桌(只有INTERVALS包含来自“map”表值的至少一个值):
final<-"chr start end
chr1 1 10
chr1 30 40
chr2 80 90"
final<-read.table(text=final,header=T)
请注意,我也考虑了染色体数目。并且,所考虑的值是“ref”上的“start”和“end”值之间的间隔,而不仅仅是“start”和“end”值本身。
为了解决chromossome的问题,我认为chr + start和chr + end分别像“tag”和tag1一样。
ref$tag <- paste0(ref$chr, "-", ref$start)
ref$tag1 <- paste0(ref$chr, "-", ref$end)
map$tag <- paste0(map$chr, "-", map$start)
答案 0 :(得分:2)
ref[ref$start %in% map$start | ref$end %in% map$start, ]
更详细:
rows_to_keep <- ref$start %in% map$start | ref$end %in% map$start
rows_to_keep
# [1] TRUE TRUE FALSE TRUE
ref[rows_to_keep, ]
# chr start end
# 1 chr1 1 2
# 2 chr2 2 10
# 4 chr2 6 10
答案 1 :(得分:0)
根据这个话题 “Finding overlapping ranges between two interval data” “一般来说,使用bioconductor包IRanges处理与间隔有关的问题是非常合适的” 所以,你在这里:
library("GenomicRanges")
library("data.table")
gr1 = with(ref, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, end)))
gr2 = with(map, GRanges(Rle(factor(chr,
levels=c("chr1", "chr2"))), IRanges(start, start)))
olaps<-subsetByOverlaps(gr1, gr2)
olaps <- as.data.frame(olaps)
col_headings <- c('chr','start', 'end', 'width', 'strand')
names(olaps) <- col_headings
final <- subset(olaps, select = c("chr", "start", "end"))
> final
chr start end
1 chr1 1 10
2 chr1 30 40
3 chr2 80 90