我有2个文件:
" query.tab"
grp pos
1 10
1 45
2 6
3 12
" data.tab"
grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black
我正在尝试从文件中添加$info
"数据"提交"查询"只有
$grp
来自"查询"匹配来自"数据"的$grp
$pos
的 query.tab
介于$start
与$end
之间data.tab
。
为了得到:
grp pos info
1 10 blue
1 45 red
2 6 green
3 12 NA
( NB :非重叠$info
可以是“NA”或“空白”,无关紧要。不应该发生反正)
到目前为止,我正在使用findOverlaps()
,但在理解如何操纵其输出方面遇到了麻烦:
library(IRanges)
query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))
query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)
o <- findOverlaps(query.ir, data.ir, type = "within")
o
Hits object with 7 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 3
[2] 1 1
[3] 2 2
[4] 3 3
[5] 3 1
[6] 4 3
[7] 4 1
-------
queryLength: 4 / subjectLength: 4
我可以从此输出中检索$info
字段,还是我在错误的轨道上?
答案 0 :(得分:0)
根据您提供的所需输出,我认为这样可行。它也可以概括,但我更喜欢这个版本,以避免任何混淆;
#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")
#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA
#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]
#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)
#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]
#output:
df.final
# grp pos info
# 1 1 10 blue
# 4 1 45 red
# 5 2 6 green
# 6 3 12 <NA>
<强> 数据:的强>
read.table(text='grp pos
1 10
1 45
2 6
3 12', header=TRUE, quote='"') -> query.tab
read.table(text='grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black', header=TRUE, quote='"') -> data.tab