检查列表是否包含R中的另一个列表

时间:2016-09-05 22:32:18

标签: r list subset

我想检查一个列表(或一个矢量,等价)是否包含在另一个列表中,而不是它是否是它的子集。我们假设我们有

r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)

该功能的行为如下:

is.contained(r,t) 
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former 
# contains two 1 whereas the latter only one.

is.contained(s,t)
[1] TRUE

运算符%in%检查子集,因此在两种情况下都会返回TRUE,同样allany。我确信有一个单行,但我只是看不到它。

4 个答案:

答案 0 :(得分:2)

如何使用循环。我迭代第一个向量并检查它是否存在于第二个向量中。如果它在那里我从第二个向量中删除它。这个过程还在继续。

is.contained=function(vec1,vec2){
    x=vector(length = length(vec1))
    for (i in 1:length(vec1)) {
        x[i] = vec1[i] %in% vec2
        if(length(which(vec1[i] %in% vec2)) == 0) vec2 else 
            vec2=vec2[-match(vec1[i], vec2)]
    }
    y=all(x==T)
    return(y)
}

答案 1 :(得分:1)

递归方法检查每个列表的重复项长度怎么样?

fun.contains <- function(b, s){
  all(s %in% b) && length(s[duplicated(s)]) <= length(b[duplicated(b)]) &&
    (if(length(s[duplicated(s)])>0) fun.contains(b[duplicated(b)],s[duplicated(s)]) else 1 )
}

这个想法是列表被包含在另一个列表中,当且仅当相应的重复项列表为时,除非没有重复项(在这种情况下,递归默认为TRUE)。

答案 2 :(得分:0)

另一个自定义函数版本,检查不相等元素(length())的元素数(setdiff)是否等于向量长度的差异:

# Does vector x contain vector y?
is.contained <- function(x, y) {
  z <- x[x %in%setdiff(x, y)]
  length(z) == length(x) - length(y)
}

r <- c(1,1)
s <- c(1,1,5)
t <- c(1,2,5)

is.contained(r, t)
#> [1] FALSE
is.contained(s, r)
#> [1] TRUE
is.contained(r, s)
#> [1] FALSE

答案 3 :(得分:0)

来自sets的{​​{1}}函数(例如,交集,并集等)给出与集合论一致的结果。从技术上讲,集合没有重复元素,因此向量base Rc(1,1,2)在集合中被认为是相同的(参见Set (Mathematics))。这是这个问题面临的主要问题,也就是为什么这里发布的一些解决方案失败(包括我以前的尝试)。 OP问题的解决方案位于理解集和sequences之间。虽然序列允许重复,但是顺序很重要,这里我们不关心(顺序在集合中无关紧要)。

下面,我提供了一个向量交叉函数(c(1,2)),它返回两个向量之间的所有公共元素,而不管重复的顺序或存在。还提供了一个名为VectorIntersect的包含函数,它调用is.contained,它将确定一个向量中的所有元素是否都存在于另一个向量中。

VectorIntersect

让我们看一个简单的例子:

VectorIntersect <- function(v,z) {
    unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}

现在,让我们来看看@Gennaro聪明的递归方法,它给出了正确的结果(许多道歉和许多赞誉...在早期的测试中,我的印象是它正在检查是否包含在s中而不是反过来说):

r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"

VectorIntersect(r, s)
[1] 1 1
is.contained(r, s)  ## r is contained in s
[1] TRUE
is.contained(s, r)  ## s is not contained in r
[1] FALSE
is.contained(s, s)  ## s is contained in itself.. more on this later
[1] TRUE

我们现在将逐步介绍其他基于集合的算法,看看它们如何处理上面的fun.contains(s, r) ## s contains r [1] TRUE fun.contains(r, s) ## r does not contain s [1] FALSE fun.contains(s, s) ## s contains s [1] TRUE r。为清楚起见,我在每个函数中添加了print语句。首先,@ Jilber的功能。

s

这是@ Simon的:

is.containedJilber <- function(x,y){
    temp <- intersect(x,y)
    print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
    out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
    return(out)  
}

is.containedJilber(r, s)   ## should return TRUE but does not
[1] "1"                    ## result of intersect
[1] 2                      ## length of r
[1] 1                      ## length of temp

## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character"    "Lengths: 2, 1"     "target is numeric, current is character"

[1] FALSE                  ## results from the fact that 2 does not equal 1

is.containedJilber(s, s)    ## should return TRUE but does not
[1] "1" "a"                 ## result of intersect
[1] 10                      ## length of s
[1] 2                       ## length of temp

## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch" 

[1] FALSE                  ## results from the fact that 10 does not equal 2

希望这能说明在此设置中应用严格设置操作的缺陷。

让我们测试效率和平等。下面,我们构建了许多测试向量,并检查它们是否包含在向量is.containedSimon <- function(x, y) { print(setdiff(x, y)) z <- x[x %in%setdiff(x, y)] print(z); print(length(x)); print(length(y)); print(length(z)) length(z) == length(x) - length(y) } is.containedSimon(s, r) ## should return TRUE but does not [1] "a" ## result of setdiff [1] "a" "a" "a" "a" "a" ## the elements in s that match the result of setdiff [1] 10 ## length of s [1] 2 ## length of r [1] 5 ## length of z [1] FALSE ## result of 5 not being equal to 10 - 2 (如果它是数字向量)或testContainsNum(如果它是字符向量)中:

testContainsAlpha

让我们仔细研究一下,确定set.seed(123) testContainsNum <- sample(20:40, 145, replace=TRUE) ## generate large vector with random numbers testContainsAlpha <- sample(letters, 175, replace=TRUE) ## generate large vector with random letters tVec <- lapply(1:1000, function(x) { ## generating test data.. if (x%%2==0) { sample(20:40, sample(50:100, 1), replace=TRUE) ## even indices will contain numbers } else { sample(letters, sample(50:90, 1), replace=TRUE) ## odd indices will contain characters } }) tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha}) ## First check equality tJoe <- mapply(is.contained, tVec, tContains) tGennaro <- mapply(fun.contains, tContains, tVec) tSimon <- mapply(is.containedSimon, tContains, tVec) tJilber <- mapply(is.containedJilber, tVec, tContains) all(tJoe==tGennaro) ## Give same results [1] TRUE ## Both Jilber's and Simon's solution don't return any TRUE values any(tJilber) [1] FALSE any(tSimon) [1] FALSE ## There should be 170 TRUEs sum(tJoe) [1] 170 is.contained的行为是否正确。

fun.contains

以下是基准:

table(tVec[[3]])
a  b  c     e  f  g  h  i  j  k  l  m  n  o  p  q  r     t  u  v  w  x  y  z 
3  4  5     2  2  1  5  3  5  3  2  1  7  3  1  2  4     3  5  5  2  4  3  3

table(tContains[[3]])
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
4 11  4  3  7  8 13  4  4  9 13  3 10  7  7  4  8  7  8  6  7  5  9  4  4  6

## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])  
[1] FALSE FALSE           ## This is correct!!!!


table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
 6  4  4  7  6  3  4  6  3  5  4  4  6  4  4  2  2  5  3  1  4 

table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
 6  4 10  7  6  4 10  6  8 10  5  5  6  9  8  5  7  5 11  4  9 

## Note above that every element in tVec[[14]] is present in 
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE            ## This is correct!!!!

有关VectorIntersect()的附注
在花了很多时间解决这个问题后,越来越清楚的是将library(microbenchmark) microbenchmark(Joe = mapply(is.contained, tVec, tContains), Gennaro = mapply(fun.contains, tContains, tVec)) Unit: milliseconds expr min lq mean median uq max neval cld Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806 100 a Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464 100 b VectorIntersect分开是非常有价值的。我在自己的工作中多次知道,不经常删除重复的交集。通常,实施的方法很混乱,可能不可靠(在此之后很容易理解为什么!)。这就是为is.contained添加VectorIntersect是一个很好的效用函数的原因。


更新

实际上@ Gennaro的解决方案可以通过计算is.contained只有一次而不是3次来改进(类似于s[duplicated(s)]b,我们只计算一次与{2}次)。

length(s)


更新2

如何测试真正大的载体的遏制?我提供的功能不太可能表现良好,因为通过基本上在真实集合交集上循环来构建“交集”(具有重复等)不是非常有效。修改后的@ Gennaro函数也不会很快,因为对于具有许多重复项的非常大的向量,函数调用可能会非常深入嵌套。考虑到这一点,我构建了另一个专门用于处理大型向量的包含函数。它利用向量化的基R函数,特别是音符fun.containsFAST <- function(b, s){ dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)] lenS <- length(dupS) all(s %in% b) && lenS <= length(dupB) && (if(lenS>0) fun.containsFAST(dupB,dupS) else 1) } microbenchmark(Joe = mapply(is.contained, tVec, tContains), GenFAST = mapply(fun.containsFAST, tContains, tVec), Gennaro = mapply(fun.contains, tContains, tVec)) Unit: milliseconds expr min lq mean median uq max neval cld Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185 100 b GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561 100 a Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249 100 c ,它返回多个向量的平行最小值。内部函数pmin.int直接取自基础R中rle函数的内容(虽然针对此特定用途略有修改)。

myL

请注意,较小的exmaples is.containedBIG <- function(v, z) { ## v and z must be sorted myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))} sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v) } is.contained更快(这主要是由于重复排序所需的时间..正如您将看到的,如果数据已排序{{1} }快得多)。观察(为了彻底,我们还将展示@ Chirayu功能和测试效率的验证):

fun.containsFAST

现在,通过排序数据,结果非常惊人。 is.containedBIG显示速度提高了3倍,而其他功能的时间几乎相同。

## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe)  ## @Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe)   ## specialized alogrithm returns correct results
[1] TRUE

microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
               JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
               GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
               Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
    Joe  154.6158  165.5861  176.3019  175.4786  180.1299  313.7974   100 a  
 JoeBIG  269.1460  282.9347  294.1568  289.0174  299.4687  445.5222   100  b   ## about 2x as slow as GenFAST
GenFAST  140.8219  150.5530  156.2019  155.8306  162.0420  178.7837   100 a  
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370   100   c  ## about 8x as slow as GenFAST

对于非常大的向量,我们有以下内容(仅显示is.containedBIG## with pre-sorted data tVecSort <- lapply(tVec, sort) tContainsSort <- lapply(tContains, sort) microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])), JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])), GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])), Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]]))) Unit: milliseconds expr min lq mean median uq max neval cld Joe 154.74771 166.46926 173.45399 172.92374 177.09029 297.7758 100 c JoeBIG 83.69259 87.35881 94.48476 92.07183 98.37235 221.6014 100 a ## now it's the fastest GenFAST 139.19631 151.23654 159.18670 157.05911 162.85636 275.7158 100 b Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805 100 d ,因为其他函数需要太长时间):

GenFAST

无论您的数据是否已排序,最后一次测试的重点是显示JoeBIG对较大数据的速度要快得多。最后一次测试的一个有趣的事实是set.seed(97) randS <- sample(10^9, 8.5*10^5) testBigNum <- sample(randS, 2*10^7, replace = TRUE) tVecBigNum <- lapply(1:20, function(x) { sample(randS, sample(1500000:2500000, 1), replace=TRUE) }) system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum)))) user system elapsed 74.016 11.351 85.409 system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]]))) user system elapsed 662.875 54.238 720.433 sum(tJoeBigNum) [1] 13 all(tJoeBigNum==tGennaroBigNum) [1] TRUE ## pre-sorted data testBigSort <- sort(testBigNum) tVecBigSort <- lapply(tVecBigNum, sort) system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort))) user system elapsed 33.910 10.302 44.289 system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]]))) user system elapsed 196.546 54.923 258.079 identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum) [1] TRUE 在数据排序时显示出非常大的时间改进。我的印象是is.containedBIG(这是fun.containsFAST的主力),并不依赖于矢量是否被排序。早期测试证实了这种情绪(未排序的测试时间几乎与排序的测试时间相同(见上文))。需要更多的研究。