我想检查一个列表(或一个矢量,等价)是否包含在另一个列表中,而不是它是否是它的子集。我们假设我们有
r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)
该功能的行为如下:
is.contained(r,t)
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former
# contains two 1 whereas the latter only one.
is.contained(s,t)
[1] TRUE
运算符%in%
检查子集,因此在两种情况下都会返回TRUE
,同样all
或any
。我确信有一个单行,但我只是看不到它。
答案 0 :(得分:2)
如何使用循环。我迭代第一个向量并检查它是否存在于第二个向量中。如果它在那里我从第二个向量中删除它。这个过程还在继续。
is.contained=function(vec1,vec2){
x=vector(length = length(vec1))
for (i in 1:length(vec1)) {
x[i] = vec1[i] %in% vec2
if(length(which(vec1[i] %in% vec2)) == 0) vec2 else
vec2=vec2[-match(vec1[i], vec2)]
}
y=all(x==T)
return(y)
}
答案 1 :(得分:1)
递归方法检查每个列表的重复项长度怎么样?
fun.contains <- function(b, s){
all(s %in% b) && length(s[duplicated(s)]) <= length(b[duplicated(b)]) &&
(if(length(s[duplicated(s)])>0) fun.contains(b[duplicated(b)],s[duplicated(s)]) else 1 )
}
这个想法是列表被包含在另一个列表中,当且仅当相应的重复项列表为时,除非没有重复项(在这种情况下,递归默认为TRUE)。
答案 2 :(得分:0)
另一个自定义函数版本,检查不相等元素(length()
)的元素数(setdiff
)是否等于向量长度的差异:
# Does vector x contain vector y?
is.contained <- function(x, y) {
z <- x[x %in%setdiff(x, y)]
length(z) == length(x) - length(y)
}
r <- c(1,1)
s <- c(1,1,5)
t <- c(1,2,5)
is.contained(r, t)
#> [1] FALSE
is.contained(s, r)
#> [1] TRUE
is.contained(r, s)
#> [1] FALSE
答案 3 :(得分:0)
来自sets
的{{1}}函数(例如,交集,并集等)给出与集合论一致的结果。从技术上讲,集合没有重复元素,因此向量base R
和c(1,1,2)
在集合中被认为是相同的(参见Set (Mathematics))。这是这个问题面临的主要问题,也就是为什么这里发布的一些解决方案失败(包括我以前的尝试)。 OP问题的解决方案位于理解集和sequences之间。虽然序列允许重复,但是顺序很重要,这里我们不关心(顺序在集合中无关紧要)。
下面,我提供了一个向量交叉函数(c(1,2)
),它返回两个向量之间的所有公共元素,而不管重复的顺序或存在。还提供了一个名为VectorIntersect
的包含函数,它调用is.contained
,它将确定一个向量中的所有元素是否都存在于另一个向量中。
VectorIntersect
让我们看一个简单的例子:
VectorIntersect <- function(v,z) {
unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}
现在,让我们来看看@Gennaro聪明的递归方法,它给出了正确的结果(许多道歉和许多赞誉...在早期的测试中,我的印象是它正在检查是否包含在s中而不是反过来说):
r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"
VectorIntersect(r, s)
[1] 1 1
is.contained(r, s) ## r is contained in s
[1] TRUE
is.contained(s, r) ## s is not contained in r
[1] FALSE
is.contained(s, s) ## s is contained in itself.. more on this later
[1] TRUE
我们现在将逐步介绍其他基于集合的算法,看看它们如何处理上面的fun.contains(s, r) ## s contains r
[1] TRUE
fun.contains(r, s) ## r does not contain s
[1] FALSE
fun.contains(s, s) ## s contains s
[1] TRUE
和r
。为清楚起见,我在每个函数中添加了print语句。首先,@ Jilber的功能。
s
这是@ Simon的:
is.containedJilber <- function(x,y){
temp <- intersect(x,y)
print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
return(out)
}
is.containedJilber(r, s) ## should return TRUE but does not
[1] "1" ## result of intersect
[1] 2 ## length of r
[1] 1 ## length of temp
## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character" "Lengths: 2, 1" "target is numeric, current is character"
[1] FALSE ## results from the fact that 2 does not equal 1
is.containedJilber(s, s) ## should return TRUE but does not
[1] "1" "a" ## result of intersect
[1] 10 ## length of s
[1] 2 ## length of temp
## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch"
[1] FALSE ## results from the fact that 10 does not equal 2
希望这能说明在此设置中应用严格设置操作的缺陷。
让我们测试效率和平等。下面,我们构建了许多测试向量,并检查它们是否包含在向量is.containedSimon <- function(x, y) {
print(setdiff(x, y))
z <- x[x %in%setdiff(x, y)]
print(z); print(length(x)); print(length(y)); print(length(z))
length(z) == length(x) - length(y)
}
is.containedSimon(s, r) ## should return TRUE but does not
[1] "a" ## result of setdiff
[1] "a" "a" "a" "a" "a" ## the elements in s that match the result of setdiff
[1] 10 ## length of s
[1] 2 ## length of r
[1] 5 ## length of z
[1] FALSE ## result of 5 not being equal to 10 - 2
(如果它是数字向量)或testContainsNum
(如果它是字符向量)中:
testContainsAlpha
让我们仔细研究一下,确定set.seed(123)
testContainsNum <- sample(20:40, 145, replace=TRUE) ## generate large vector with random numbers
testContainsAlpha <- sample(letters, 175, replace=TRUE) ## generate large vector with random letters
tVec <- lapply(1:1000, function(x) { ## generating test data..
if (x%%2==0) {
sample(20:40, sample(50:100, 1), replace=TRUE) ## even indices will contain numbers
} else {
sample(letters, sample(50:90, 1), replace=TRUE) ## odd indices will contain characters
}
})
tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha})
## First check equality
tJoe <- mapply(is.contained, tVec, tContains)
tGennaro <- mapply(fun.contains, tContains, tVec)
tSimon <- mapply(is.containedSimon, tContains, tVec)
tJilber <- mapply(is.containedJilber, tVec, tContains)
all(tJoe==tGennaro) ## Give same results
[1] TRUE
## Both Jilber's and Simon's solution don't return any TRUE values
any(tJilber)
[1] FALSE
any(tSimon)
[1] FALSE
## There should be 170 TRUEs
sum(tJoe)
[1] 170
和is.contained
的行为是否正确。
fun.contains
以下是基准:
table(tVec[[3]])
a b c e f g h i j k l m n o p q r t u v w x y z
3 4 5 2 2 1 5 3 5 3 2 1 7 3 1 2 4 3 5 5 2 4 3 3
table(tContains[[3]])
a b c d e f g h i j k l m n o p q r s t u v w x y z
4 11 4 3 7 8 13 4 4 9 13 3 10 7 7 4 8 7 8 6 7 5 9 4 4 6
## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])
[1] FALSE FALSE ## This is correct!!!!
table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 4 7 6 3 4 6 3 5 4 4 6 4 4 2 2 5 3 1 4
table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 10 7 6 4 10 6 8 10 5 5 6 9 8 5 7 5 11 4 9
## Note above that every element in tVec[[14]] is present in
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE ## This is correct!!!!
有关VectorIntersect()的附注
在花了很多时间解决这个问题后,越来越清楚的是将library(microbenchmark)
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806 100 a
Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464 100 b
与VectorIntersect
分开是非常有价值的。我在自己的工作中多次知道,不经常删除重复的交集。通常,实施的方法很混乱,可能不可靠(在此之后很容易理解为什么!)。这就是为is.contained
添加VectorIntersect
是一个很好的效用函数的原因。
的更新强>
实际上@ Gennaro的解决方案可以通过计算is.contained
只有一次而不是3次来改进(类似于s[duplicated(s)]
和b
,我们只计算一次与{2}次)。
length(s)
更新2
如何测试真正大的载体的遏制?我提供的功能不太可能表现良好,因为通过基本上在真实集合交集上循环来构建“交集”(具有重复等)不是非常有效。修改后的@ Gennaro函数也不会很快,因为对于具有许多重复项的非常大的向量,函数调用可能会非常深入嵌套。考虑到这一点,我构建了另一个专门用于处理大型向量的包含函数。它利用向量化的基R函数,特别是音符fun.containsFAST <- function(b, s){
dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)]
lenS <- length(dupS)
all(s %in% b) && lenS <= length(dupB) &&
(if(lenS>0) fun.containsFAST(dupB,dupS) else 1)
}
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
GenFAST = mapply(fun.containsFAST, tContains, tVec),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185 100 b
GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561 100 a
Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249 100 c
,它返回多个向量的平行最小值。内部函数pmin.int
直接取自基础R中rle函数的内容(虽然针对此特定用途略有修改)。
myL
请注意,较小的exmaples is.containedBIG <- function(v, z) { ## v and z must be sorted
myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))}
sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v)
}
和is.contained
更快(这主要是由于重复排序所需的时间..正如您将看到的,如果数据已排序{{1} }快得多)。观察(为了彻底,我们还将展示@ Chirayu功能和测试效率的验证):
fun.containsFAST
现在,通过排序数据,结果非常惊人。 is.containedBIG
显示速度提高了3倍,而其他功能的时间几乎相同。
## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe) ## @Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe) ## specialized alogrithm returns correct results
[1] TRUE
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.6158 165.5861 176.3019 175.4786 180.1299 313.7974 100 a
JoeBIG 269.1460 282.9347 294.1568 289.0174 299.4687 445.5222 100 b ## about 2x as slow as GenFAST
GenFAST 140.8219 150.5530 156.2019 155.8306 162.0420 178.7837 100 a
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370 100 c ## about 8x as slow as GenFAST
对于非常大的向量,我们有以下内容(仅显示is.containedBIG
和## with pre-sorted data
tVecSort <- lapply(tVec, sort)
tContainsSort <- lapply(tContains, sort)
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.74771 166.46926 173.45399 172.92374 177.09029 297.7758 100 c
JoeBIG 83.69259 87.35881 94.48476 92.07183 98.37235 221.6014 100 a ## now it's the fastest
GenFAST 139.19631 151.23654 159.18670 157.05911 162.85636 275.7158 100 b
Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805 100 d
,因为其他函数需要太长时间):
GenFAST
无论您的数据是否已排序,最后一次测试的重点是显示JoeBIG
对较大数据的速度要快得多。最后一次测试的一个有趣的事实是set.seed(97)
randS <- sample(10^9, 8.5*10^5)
testBigNum <- sample(randS, 2*10^7, replace = TRUE)
tVecBigNum <- lapply(1:20, function(x) {
sample(randS, sample(1500000:2500000, 1), replace=TRUE)
})
system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum))))
user system elapsed
74.016 11.351 85.409
system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]])))
user system elapsed
662.875 54.238 720.433
sum(tJoeBigNum)
[1] 13
all(tJoeBigNum==tGennaroBigNum)
[1] TRUE
## pre-sorted data
testBigSort <- sort(testBigNum)
tVecBigSort <- lapply(tVecBigNum, sort)
system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort)))
user system elapsed
33.910 10.302 44.289
system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]])))
user system elapsed
196.546 54.923 258.079
identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum)
[1] TRUE
在数据排序时显示出非常大的时间改进。我的印象是is.containedBIG
(这是fun.containsFAST
的主力),并不依赖于矢量是否被排序。早期测试证实了这种情绪(未排序的测试时间几乎与排序的测试时间相同(见上文))。需要更多的研究。