如何计算R中的重叠百分比

时间:2015-04-17 14:59:56

标签: r bioinformatics

我试图用基因组坐标计算两个数据集之间的重叠百分比,满足某些标准。

SEG2

ID   chrom loc.start   loc.end num.mark seg.mean
AB    1   3010000 173490000     8430   0.0039
AB    1 173510000 173590000        5  -17.738
AB    1 173610000 173830000       12    0.011
AB    1 173850000 173970000        6  -16.121
AB    2   3090000 181990000     8434    0.011
BB   12   3090000  68990000     2950   -0.2022
BB   12  69010000  87790000      889    0.0267
BB   12  88010000  98550000      507   -0.3337
BB   12  98570000 115090000      800    0.0586
BB   12 115110000 119350000      197   -0.2031
BB   12 119370000 119430000        4   -20.671

通过

 chr     start   end    CNA      sample.ID
  1  68580000  68640000 loss    1-68580000-68640000
  3  15360000  16000000 loss    3-15360000-16000000
  4 122660000 123500000 gain   4-122660000-123500000
  7  48320000  48400000 loss    7-48320000-48400000
  12 115860000 115980000 loss  12-115860000-115980000
 12 113560000 114920000 gain   12-113560000-114920000

预期输出

ID   chrom loc.start   loc.end num.mark seg.mean  lm(percentage of overlap)
AB    1   3010000 173490000     8430   0.0039         %
AB    1 173510000 173590000        5  -17.738     
AB    1 173610000 173830000       12    0.011     
AB    1 173850000 173970000        6  -16.121     
AB    2   3090000 181990000     8434    0.011     
BB   12   3090000  68990000     2950   -0.2022     
BB   12  69010000  87790000      889    0.0267
BB   12  88010000  98550000      507   -0.3337
BB   12  98570000 115090000      800    0.0586
BB   12 115110000 119350000      197   -0.2031
BB   12 119370000 119430000        4   -20.671

我试过这个剧本,但它没有用。

for (i in 1:now(seg2)) { 
    seg2$lm <- if((seg2$chrom[i] == over$chr[i]) |
    (seg2$loc.start[i] <= over$start[i] & seg2$loc.end[i] >= over$end[i]) |
    (over$seg.mean[i] >= 0.459 & seg2$CNA[i] == "gain") |
    (over$seg.mean[i] <= -0.678 & seg2$CNA[i] == "loss"), 
    (over$end[i]-over$start[i])/(seg2$loc.end[i]-seg2$loc.start[i])*100)
    }

我知道GenomicRanges包,但感谢您的建议。

1 个答案:

答案 0 :(得分:2)

我强烈建议您使用GenomicFeatures来有效地执行此操作。如果您已经知道创建自己的Granges对象,则需要执行以下两个步骤才能获得重叠长度

# to find overlaps
overlappin.index = findOverlaps(object1, object2)

# to get the overlap length 
width(ranges(overlapping.index, ranges(object1),ranges(object2)))

其中,&#34; object1&#34;和&#34; object2&#34;是带有坐标的GRanges个对象,&#34; overlappin.index&#34;是重叠的对象的索引。 一旦掌握了长度,就可以轻松获得百分比。