我有一个向量列表:
l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20, 20:21, 21:22,
23:24, 23:25, 26:27, 30:31, 31:32, 33:34, 35:36, 38:39, 42:43,
44:45, 46:47, 50:51, 54:55, 55:56, 57:58, 59:60, 64:65, 66:67,
68:69, 69:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91,
92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)
如果向量与彼此重叠(包括),我需要将它们合并(因此减少列表的长度),以便涵盖尽可能广泛的范围。
例如取l1
的前7个元素,
l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20,...
我希望将其转化为:
l2 <- list(2:3, 4:5, 6:7, 8:9, 15:20,...
我怎样才能有效地做到这一点?
答案 0 :(得分:4)
这是一个解决方案 - 首先调整每个向量的末端,使向量相互稍远,然后取消列表并找到所有小于1的条目:
# create a sorted vector adjusted end values
vec <- sort(unlist(lapply(l1, function(x) c(x[1] + 0.1,
head(tail(x, -1), -1),
tail(x, 1) - 0.1))))
# split vector if the difference between values is greater than 1
# then convert back to integer and remove the duplicates
lapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(x) unique(round(x)))
结果:
$`0`
[1] 2 3
$`1`
[1] 4 5
$`2`
[1] 6 7
$`3`
[1] 8 9
$`4`
[1] 15 16 17 18 19 20 21 22
$`5`
[1] 23 24 25
$`6`
[1] 26 27
$`7`
[1] 30 31 32
$`8`
[1] 33 34
$`9`
[1] 35 36
$`10`
[1] 38 39
$`11`
[1] 42 43
$`12`
[1] 44 45
$`13`
[1] 46 47
$`14`
[1] 50 51
$`15`
[1] 54 55 56
$`16`
[1] 57 58
$`17`
[1] 59 60
$`18`
[1] 64 65
$`19`
[1] 66 67
$`20`
[1] 68 69 70
$`21`
[1] 73 74
$`22`
[1] 77 78
$`23`
[1] 80 81
$`24`
[1] 82 83
$`25`
[1] 84 85
$`26`
[1] 88 89
$`27`
[1] 90 91
$`28`
[1] 92 93
$`29`
[1] 94 95
$`30`
[1] 96 97
$`31`
[1] 100 101
$`32`
[1] 103 104
$`33`
[1] 105 106
$`34`
[1] 107 108
答案 1 :(得分:4)
数据似乎紧凑地表示为范围而不是显式向量
rng = matrix(c(sapply(l1, min), sapply(l1, max)), ncol=2,
dimnames=list(NULL, c("start", "end")))
这似乎是一种更好的表现形式,即使对于结果也是如此,因此我们始终使用它并与原始问题的措辞形成对比。对于相对密集范围的纯R解决方案,其最大数量不会太长(例如,数百万),是将结束的出现列表并在整个范围内开始
ends = tabulate(rng[,"end"])
starts = tabulate(rng[,"start"], length(ends))
找到'coverage',其中累计起始次数大于累计结束次数
coverage = cumsum(starts - ends) != 0
并计算这些范围的开头和结尾
change = diff(coverage)
beg = 1 + which(change == 1)
end = 1 + which(change == -1)
导致
f0 = function(rng) {
ends <- tabulate(rng[, "end"])
starts <- tabulate(rng[, "start"], length(ends))
coverage <- cumsum(starts - ends)
change <- diff(c(0, coverage) != 0)
beg <- which(change == 1)
end <- which(change == -1)
matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}
和
> head(f0(rng))
start end
[1,] 2 3
[2,] 4 5
[3,] 6 7
[4,] 8 9
[5,] 15 22
[6,] 23 25
可能是范围是稀疏的或非整数的,当策略可能是用1或-1来标记有序的开始和结束坐标时,并采用类似的计算覆盖率的策略
f1 <- function(rng) {
o <- order(rng)
bounds <- c(rep(1, nrow(rng)), rep(-1, nrow(rng)))[o]
coverage <- cumsum(bounds)
change <- diff(c(0, coverage != 0))
orng <- rng[o]
beg <- orng[change == 1]
end <- orng[change == -1]
matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}
Bioconductor IRanges软件包提供了一种经过测试的替代方案,而不是这些临时解决方案,在范围内执行“减少”(完全是感兴趣的操作,将重叠范围减小到最大的封闭范围)。 p>
library(IRanges)
f2 <- function(rng) {
r <- reduce(IRanges(rng[,1], rng[,2]), min.gapwidth=0)
matrix(c(start(r), end(r)), ncol=2,
dimnames=list(NULL, c("start", "end")))
}
我猜这些解决方案都没有完全正确,因为显然范围18:20,20:21,...不应该重叠......
作为有效性,我们有
> identical(f0(rng), f1(rng))
[1] TRUE
> identical(f0(rng), f2(rng))
[1] TRUE
其他解决方案的结果不完全可比,但实施方式为
f3 <- function(l2) {
for(i in seq_along(l2)[-length(l2)]) {
if(length(intersect(l2[[i]], l2[[i+1]])) > 0) {
l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
l2[[i]] <- as.list(NULL)
}
}
Filter(function(x) length(x) > 0, l2)
}
f4 <- function(l1) {
vec <- sort(unlist(lapply(l1, function(x) {
c(x[1] + 0.1, head(tail(x, -1), -1), tail(x, 1) - 0.1)
})))
lapply(split(vec, c(0, cumsum(diff(vec) > 1))),
function(x) unique(round(x)))
}
显示时间
> library(microbenchmark)
> microbenchmark(f0(rng), f1(rng), f2(rng), f3(l1), f4(l1))
Unit: microseconds
expr min lq median uq max neval
f0(rng) 168.740 184.8365 196.598 206.9565 235.353 100
f1(rng) 478.184 518.8550 565.973 594.1910 681.029 100
f2(rng) 906.578 969.1530 1026.590 1119.5225 1201.842 100
f3(l1) 4341.560 4600.6330 4644.767 4696.1170 5225.190 100
f4(l1) 9652.549 10220.5320 10275.517 10364.2365 11439.372 100
解决方案f0-f2适用于不同的域,特别是IRanges解决方案可能既健壮又灵活(不仅仅是'减少'!),并且对大型数据集具有高性能。
答案 2 :(得分:3)
一种天真的方法可能就是这样:
l2 <- l1
for(i in seq_along(l2)[-length(l2)]) {
if(length(intersect(l2[[i]], l2[[i+1]])) > 0) {
l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
l2[[i]] <- as.list(NULL)
}
}
dput(Filter(function(x) length(x) > 0, l2))
list(2:3, 4:5, 6:7, 8:9, 15:22, 23:25, 26:27, 30:32, 33:34, 35:36,
38:39, 42:43, 44:45, 46:47, 50:51, 54:56, 57:58, 59:60, 64:65,
66:67, 68:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91,
92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)