Question

我尝试使用data.table在R中合并一堆重叠的时间段。我已经打电话给桌子上的foverlap，这很有效。

我的问题是这样的：假设时段A与时段B重叠，而B与时段C重叠，但A不重叠C.在这种情况下，A不与C分组，最终它们必须合并。

目前我已经找到了重复循环查找重叠和合并，直到不再发生合并，但这不完全可扩展。我能看到的一个解决方案是将组的索引递归地应用于自身，直到它稳定，但仍然看起来需要一个循环，我想要一个完全矢量化的解决方案。

dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))
setkeyv(dt,c("start","end"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1,1,3,3,3,3]
print(f)
#1 1 3 3 4 5
print(f[f])
#1 1 3 3 3 4
print(f[f][f])
#1 1 3 3 3 3

有人可以帮我提一些关于矢量化这个程序的想法吗？

使用ID进行编辑：

dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))
setkeyv(dt,c("id","eventStart","eventEnd"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1 1 3 3 3 6 6 8] or similar

Answer 1

IRanges data.table受到启发的Bioconductor上的foverlaps()包有一些方便的功能来解决这个问题。

也许，reduce()可能是您要合并所有重叠时段的功能：

library(data.table)
dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))

library(IRanges)
ir <- IRanges(dt$start, dt$end)

ir

IRanges object with 6 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         2         2
  [2]         2         3         2
  [3]         4         6         3
  [4]         6         8         3
  [5]         8        10         3
  [6]        10        12         3

reduce(ir, min.gapwidth = 0L)

IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        12         9

as.data.table(reduce(ir, min.gapwidth = 0L))

   start end width
1:     1   3     3
2:     4  12     9

在Bioconductor上，有一个全面的Introduction to IRanges可用。

修改：OP提供了第二个样本数据集，其中包含id列，并询问IRanges是否支持id加入时间间隔

向IRanges添加数据似乎很快就专注于基因组研究领域，这对我来说是 terra incognita 。但是，我使用IRanges找到了以下方法：

使用`IRanges`

进行分组

library(data.table)
# 2nd sample data set provided by the OP
dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))

library(IRanges)
# set names when constructing IRanges object
ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)

lapply(split(ir, names(ir)), reduce, min.gapwidth = 0L)

$A
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        10         7

$B
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        14         5
  [2]        15        16         2

将此转换回data.table会导致一段相当笨拙的代码：

ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)
rbindlist(lapply(split(ir, names(ir)), 
                 function(x) as.data.table(reduce(x, min.gapwidth = 0L))), 
          idcol = "id")

   id start end width
1:  A     1   3     3
2:  A     4  10     7
3:  B    10  14     5
4:  B    15  16     2

在`data.table`

内分组

如果我们在data.table内进行分组并在单个块上应用reduce()，我们可以使用较少复杂的代码获得相同的结果：

dt[, as.data.table(reduce(IRanges(eventStart, eventEnd), min.gapwidth = 0L)), id]

在R的数据中查找foverlap的一次迭代中的所有重叠。

1 个答案:

使用`IRanges`

在`data.table`

在R的数据中查找foverlap的一次迭代中的所有重叠。

1 个答案:

使用IRanges

在data.table

使用`IRanges`

在`data.table`