Question

我有2个文件：

＆＃34; query.tab＆＃34;

＆＃34; data.tab＆＃34;

grp   start   end   info
1   1   15   blue
1   23   60   red
2   1   40   green
3   20   30   black

我正在尝试从文件中添加$info＆＃34;数据＆＃34;提交＆＃34;查询＆＃34;只有

$grp来自＆＃34;查询＆＃34;匹配来自＆＃34;数据＆＃34;的$grp

$pos

query.tab介于$start与$end之间data.tab。

为了得到：

   grp  pos   info
   1    10    blue
   1    45    red
   2    6     green
   3    12    NA

（ NB ：非重叠$info可以是“NA”或“空白”，无关紧要。不应该发生反正）

到目前为止，我正在使用findOverlaps()，但在理解如何操纵其输出方面遇到了麻烦：

library(IRanges)

query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))

query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)

o <- findOverlaps(query.ir, data.ir, type = "within")
o
Hits object with 7 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           3
  [2]         1           1
  [3]         2           2
  [4]         3           3
  [5]         3           1
  [6]         4           3
  [7]         4           1
  -------

queryLength: 4 / subjectLength: 4

我可以从此输出中检索$info字段，还是我在错误的轨道上？

Answer 1

根据您提供的所需输出，我认为这样可行。它也可以概括，但我更喜欢这个版本，以避免任何混淆;

#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")

#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA

#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]

#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)

#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]

#output:
df.final

#   grp pos  info 
# 1   1  10  blue 
# 4   1  45   red 
# 5   2   6 green 
# 6   3  12  <NA>

<强> 数据：的

read.table(text='grp   pos
       1   10
       1   45
       2   6
       3   12', header=TRUE, quote='"') -> query.tab

read.table(text='grp   start   end   info
       1   1   15   blue
       1   23   60   red
       2   1   40   green
       3   20   30   black', header=TRUE, quote='"') -> data.tab

R：如果位置在间隔

1 个答案: