Question

我想做类似于thread中的解决方案，其中我有两个数据帧，我想找到重叠的区域，然后将相应的数据合并到命中

>x1
  chr start stop CN
1   1    10  140  G
2   1   100 1000  G
3   1  1500 5000  L



>x2
  chr start stop gene
1   1     1  100    a
2   1   100  150    b
3   1   190 1000    c
4   1  1000 2000    d
5   1  2000 5000    e

我可以找到与以下代码重叠的区域：

library(GenomicRanges)
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
gr2 = with(x2, GRanges(chr, IRanges(start=start, end=stop)))

hits = findOverlaps(gr1, gr2)

命中显示x1中与x2重叠的区域，例如：

> hits
Hits of length 8
queryLength: 3
subjectLength: 5
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           1 
 2         1           2 
 3         2           1 
 4         2           2 
 5         2           3 
 6         2           4 
 7         3           4 
 8         3           5

我想做的是输出包括来自x1和x2的基因和CN信息。输出看起来像这样

 x1chr x1start x1stop x1CN x2chr x2start x2stop x2gene
1     1      10    140    g     1       1    100      a
2     1      10    140    g     1     100    150      b
3     1     100   1000    g     1       1    100      a
4     1     100   1000    g     1     100    150      b
5     1     100   1000    g     1     190   1000      c
6     1     100   1000    g     1    1000   2000      d
7     1    1500   5000    l     1    1000   2000      d
8     1    1500   5000    l     1    2000   5000      e

Answer 1

您可以使用foverlaps包

中的data.table

library(data.table)
setkey(setDT(x1), start, stop)
setkey(setDT(x2), start, stop)
foverlaps(x2, x1)
#   chr start stop CN i.chr i.start i.stop gene
#1:   1    10  140  G     1       1    100    a
#2:   1   100 1000  G     1       1    100    a
#3:   1    10  140  G     1     100    150    b
#4:   1   100 1000  G     1     100    150    b
#5:   1   100 1000  G     1     190   1000    c
#6:   1   100 1000  G     1    1000   2000    d
#7:   1  1500 5000  L     1    1000   2000    d
#8:   1  1500 5000  L     1    2000   5000    e

Answer 2

我设法找到了一个非常简单的解决方案。使用代码：

x<-cbind(x1[queryHits(hits),],x2[subjectHits(hits),])

这提供了所需的输出

Answer 3

如果您使用的是linux或mac系统，则可以安装bedtools（http://bedtools.readthedocs.org/en/latest/index.html）。然后使用命令＆＃34; intersectBed -a fileA.txt -b fileB.txt -wa -wb＆gt; youroutputfile.txt＆＃34 ;.您将获得包含数据框A和数据框B的结果文件。使用bedtools处理高吞吐量数据集会更快更流行。

重叠基因组间隔和合并数据集

3 个答案: