R:如果位置在间隔

时间:2017-06-20 18:08:05

标签: r overlap

我有2个文件:

" query.tab"

grp   pos
1   10
1   45
2   6
3   12

" data.tab"

grp   start   end   info
1   1   15   blue
1   23   60   red
2   1   40   green
3   20   30   black

我正在尝试从文件中添加$info"数据"提交"查询"只有

  1. $grp来自"查询"匹配来自"数据"的$grp

  2. 来自$pos
  3. query.tab介于$start$end之间data.tab

  4. 为了得到:

       grp  pos   info
       1    10    blue
       1    45    red
       2    6     green
       3    12    NA
    

    NB :非重叠$info可以是“NA”或“空白”,无关紧要。不应该发生反正)

    到目前为止,我正在使用findOverlaps(),但在理解如何操纵其输出方面遇到了麻烦:

    library(IRanges)
    
    query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
    data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))
    
    query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
    data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)
    
    o <- findOverlaps(query.ir, data.ir, type = "within")
    o
    Hits object with 7 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         1           3
      [2]         1           1
      [3]         2           2
      [4]         3           3
      [5]         3           1
      [6]         4           3
      [7]         4           1
      -------
    
    queryLength: 4 / subjectLength: 4
    

    我可以从此输出中检索$info字段,还是我在错误的轨道上?

1 个答案:

答案 0 :(得分:0)

根据您提供的所需输出,我认为这样可行。它也可以概括,但我更喜欢这个版本,以避免任何混淆;

#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")

#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA

#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]

#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)

#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]

#output:
df.final

#   grp pos  info 
# 1   1  10  blue 
# 4   1  45   red 
# 5   2   6 green 
# 6   3  12  <NA>

<强> 数据:

read.table(text='grp   pos
       1   10
       1   45
       2   6
       3   12', header=TRUE, quote='"') -> query.tab

read.table(text='grp   start   end   info
       1   1   15   blue
       1   23   60   red
       2   1   40   green
       3   20   30   black', header=TRUE, quote='"') -> data.tab