如何合并R中列表的重叠整数向量元素

时间:2014-02-21 20:10:42

标签: r

我有一个向量列表:

l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20, 20:21, 21:22, 
        23:24, 23:25, 26:27, 30:31, 31:32, 33:34, 35:36, 38:39, 42:43, 
        44:45, 46:47, 50:51, 54:55, 55:56, 57:58, 59:60, 64:65, 66:67, 
        68:69, 69:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91, 
        92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)

如果向量与彼此重叠(包括),我需要将它们合并(因此减少列表的长度),以便涵盖尽可能广泛的范围。

例如取l1的前7个元素,

l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20,...

我希望将其转化为:

l2 <- list(2:3, 4:5, 6:7, 8:9, 15:20,...

我怎样才能有效地做到这一点?

3 个答案:

答案 0 :(得分:4)

这是一个解决方案 - 首先调整每个向量的末端,使向量相互稍远,然后取消列表并找到所有小于1的条目:

# create a sorted vector adjusted end values
vec <- sort(unlist(lapply(l1, function(x) c(x[1] + 0.1,
                                            head(tail(x, -1), -1),
                                            tail(x, 1) - 0.1))))

# split vector if the difference between values is greater than 1
# then convert back to integer and remove the duplicates
lapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(x) unique(round(x)))

结果:

$`0`
[1] 2 3

$`1`
[1] 4 5

$`2`
[1] 6 7

$`3`
[1] 8 9

$`4`
[1] 15 16 17 18 19 20 21 22

$`5`
[1] 23 24 25

$`6`
[1] 26 27

$`7`
[1] 30 31 32

$`8`
[1] 33 34

$`9`
[1] 35 36

$`10`
[1] 38 39

$`11`
[1] 42 43

$`12`
[1] 44 45

$`13`
[1] 46 47

$`14`
[1] 50 51

$`15`
[1] 54 55 56

$`16`
[1] 57 58

$`17`
[1] 59 60

$`18`
[1] 64 65

$`19`
[1] 66 67

$`20`
[1] 68 69 70

$`21`
[1] 73 74

$`22`
[1] 77 78

$`23`
[1] 80 81

$`24`
[1] 82 83

$`25`
[1] 84 85

$`26`
[1] 88 89

$`27`
[1] 90 91

$`28`
[1] 92 93

$`29`
[1] 94 95

$`30`
[1] 96 97

$`31`
[1] 100 101

$`32`
[1] 103 104

$`33`
[1] 105 106

$`34`
[1] 107 108

答案 1 :(得分:4)

数据似乎紧凑地表示为范围而不是显式向量

rng = matrix(c(sapply(l1, min), sapply(l1, max)), ncol=2,
             dimnames=list(NULL, c("start", "end")))

这似乎是一种更好的表现形式,即使对于结果也是如此,因此我们始终使用它并与原始问题的措辞形成对比。对于相对密集范围的纯R解决方案,其最大数量不会太长(例如,数百万),是将结束的出现列表并在整个范围内开始

ends = tabulate(rng[,"end"])
starts = tabulate(rng[,"start"], length(ends))

找到'coverage',其中累计起始次数大于累计结束次数

coverage = cumsum(starts - ends) != 0

并计算这些范围的开头和结尾

change = diff(coverage)
beg = 1 + which(change == 1)
end = 1 + which(change == -1)

导致

f0 = function(rng) {
    ends <- tabulate(rng[, "end"])
    starts <- tabulate(rng[, "start"], length(ends))
    coverage <- cumsum(starts - ends)
    change <- diff(c(0, coverage) != 0)
    beg <- which(change == 1)
    end <- which(change == -1)
    matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}

> head(f0(rng))
     start end
[1,]     2   3
[2,]     4   5
[3,]     6   7
[4,]     8   9
[5,]    15  22
[6,]    23  25

可能是范围是稀疏的或非整数的,当策略可能是用1或-1来标记有序的开始和结束坐标时,并采用类似的计算覆盖率的策略

f1 <- function(rng) {
    o <- order(rng)
    bounds <- c(rep(1, nrow(rng)), rep(-1, nrow(rng)))[o]
    coverage <- cumsum(bounds)
    change <- diff(c(0, coverage != 0))
    orng <- rng[o]
    beg <- orng[change == 1]
    end <- orng[change == -1]
    matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}    

Bioconductor IRanges软件包提供了一种经过测试的替代方案,而不是这些临时解决方案,在范围内执行“减少”(完全是感兴趣的操作,将重叠范围减小到最大的封闭范围)。 p>

library(IRanges)
f2 <- function(rng) {
    r <- reduce(IRanges(rng[,1], rng[,2]), min.gapwidth=0)
    matrix(c(start(r), end(r)), ncol=2,
           dimnames=list(NULL, c("start", "end")))
}

我猜这些解决方案都没有完全正确,因为显然范围18:20,20:21,...不应该重叠......

作为有效性,我们有

> identical(f0(rng), f1(rng))
[1] TRUE
> identical(f0(rng), f2(rng))
[1] TRUE

其他解决方案的结果不完全可比,但实施方式为

f3 <- function(l2) {
    for(i in seq_along(l2)[-length(l2)]) {
        if(length(intersect(l2[[i]], l2[[i+1]])) > 0) { 
            l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
            l2[[i]] <- as.list(NULL)
        }   
    }
    Filter(function(x) length(x) > 0, l2)
}

f4 <- function(l1) {
    vec <- sort(unlist(lapply(l1, function(x) {
        c(x[1] + 0.1, head(tail(x, -1), -1), tail(x, 1) - 0.1)
    })))
    lapply(split(vec, c(0, cumsum(diff(vec) > 1))),
           function(x) unique(round(x)))
}

显示时间

> library(microbenchmark)
> microbenchmark(f0(rng), f1(rng), f2(rng), f3(l1), f4(l1))
Unit: microseconds
    expr      min         lq    median         uq       max neval
 f0(rng)  168.740   184.8365   196.598   206.9565   235.353   100
 f1(rng)  478.184   518.8550   565.973   594.1910   681.029   100
 f2(rng)  906.578   969.1530  1026.590  1119.5225  1201.842   100
  f3(l1) 4341.560  4600.6330  4644.767  4696.1170  5225.190   100
  f4(l1) 9652.549 10220.5320 10275.517 10364.2365 11439.372   100

解决方案f0-f2适用于不同的域,特别是IRanges解决方案可能既健壮又灵活(不仅仅是'减少'!),并且对大型数据集具有高性能。

答案 2 :(得分:3)

一种天真的方法可能就是这样:

l2 <- l1
for(i in seq_along(l2)[-length(l2)]) {
   if(length(intersect(l2[[i]], l2[[i+1]])) > 0) { 
      l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
      l2[[i]] <- as.list(NULL)
   }   
}
dput(Filter(function(x) length(x) > 0, l2))
list(2:3, 4:5, 6:7, 8:9, 15:22, 23:25, 26:27, 30:32, 33:34, 35:36, 
    38:39, 42:43, 44:45, 46:47, 50:51, 54:56, 57:58, 59:60, 64:65, 
    66:67, 68:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91, 
    92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)