如何使用R提取字段'region'的非重叠范围?

时间:2015-08-12 15:00:58

标签: r

给定一个文件:

三列是:ID,左侧区域和右侧区域。

region1 1 100
region2 20 120
region3 101 200
region4 220 280

我怎样才能提取不重叠的区域? 得到:

region1 1 100
region3 101 200
region4 220 280

1 个答案:

答案 0 :(得分:0)

这是一个使用循环来比较每一行/范围与前一行并使用函数来发现重叠的解决方案。

# example dataset
dt = data.frame(region = 1:4,
                min = c(1,20,101,220),
                max = c(100,120,200,280))

# order data based on minimum value of range (in case you don't have an order already)
dt = dt[order(dt$min),]

dt

# region min max
# 1      1   1 100
# 2      2  20 120
# 3      3 101 200
# 4      4 220 280


# function that spots overlap
overlap = function(x,y) {

  res = ifelse(x[2] >= y[1],1,0)
  return(res)
}


# set starting point (row)
i = 2

# a loop that compares each row with the previous one and deletes row when it finds overlap
while(i <= nrow(dt)){

dt_temp = dt

if (overlap(dt_temp[i-1,2:3], dt_temp[i,2:3]) == 1) {dt_temp[i,]=NA; dt = dt_temp[complete.cases(dt_temp),]} else {dt = dt_temp; i = i+1}

}

dt

# region min max
# 1      1   1 100
# 3      3 101 200
# 4      4 220 280

请注意,此过程取决于计算重叠的第一个(固定)范围。因此,如果你有[1,100],[5,10],[15,30],[32,60]的范围,它将只返回[1,100],因为其余的都与它重叠。