我正在尝试测试短数字向量是否是较长数值向量的一部分。例如,如果a = c(2, 3)
和b = c(1, 3, 2, 4, 2, 3, 1)
,那么我试图找到/想出一个能回答问题的函数:a
是b
的一部分?输出应为TRUE
。
或者,如果c = c(1, 3, 2, 4, 1, 3, 1)
,那么"的输出是a
c
的一部分?"应该是FALSE
。
match()
无法完成这项工作:
match(a, b)
返回
3 2
%in%
运营商也没有:
TRUE TRUE
我知道有字符串匹配的选项,但我不想通过转换为字符串解决此问题...
答案 0 :(得分:6)
这是我对它的抨击
valInLong <- function(val, long){
n.long <- length(long)
n.val <- length(val)
# Find where in the longer vector the first
# element of val is. This is so we can vectorize later
first <- which(long == val[1])
# If the first element is too near the end we don't care
# about it
first <- first[first <= n.long - n.val + 1]
# sequence from 0 to n.val - 1 used for grabbing subsequences
se <- seq_along(val)-1
# Look at all subsequences starting at 'first' that go
# the length of val and do an elementwise comparison.
# If any match in all positions then the subsequence is
# the sequence of interest.
any(sapply(first, function(x){all(long[x+se] == val)}))
}
long <- rpois(1000, 5)
a <- c(123421, 232, 23423) # probably not in long
valInLong(a, long)
a <- long[34:100]
valInLong(a, long)
答案 1 :(得分:5)
这是一次尝试。我不认为它超级快,但它也不是超慢:
a = c(2,3)
b1 = c(1, 3, 2, 4, 2, 3, 1)
b2 = c(1, 3, 2, 4, 1, 3, 1)
ainb <- function(a,b) {
any(apply( embed(b,length(a)), 1, function(x) all(rev(a)==x) ))
}
ainb(a,b1)
#[1] TRUE
ainb(a,b2)
#[1] FALSE
答案 2 :(得分:3)
如果你的载体不会太长,你总是可以强制它:
f <- function(a, b) {
if(length(a)==0) return(TRUE)
ix <- seq_along(b)
for(i in seq_along(a)) {
ix <- ix[which(a[i] == b[ix + i - 1])]
}
length(ix) > 0
}
f(a, b)
# [1] TRUE
f(a, c)
# [1] FALSE
答案 3 :(得分:3)
鉴于OP写道“我不想通过转换为字符串来解决这个问题...”,@ thelatemail的评论(“转换为字符串有时可能会比其他解决方案慢很多。但是我绝对会根据人们提出的解决方案来保留我的判断。“)我有点好奇,看看基于字符串的解决方案是如何执行的。看起来并不太糟糕。
我在这里使用base
grepl
和stringi
等效stri_detect_fixed
。它们对于原始(短)向量来说是最快的。 @ Dason的解决方案对于中等大小的向量来说是最快的,for
- 循环对于'长'向量来说是最快的。
h1 <- function(val, long){
grepl(pattern = paste0(val, collapse = ","), x = paste0(long, collapse = ","))
}
library(stringi)
h2 <- function(val, long){
stri_detect_fixed(str = paste0(long, collapse = ","), pattern = paste0(val, collapse = ","))
}
a <- c(2, 3)
b <- c(1, 3, 2, 4, 2, 3, 1)
c <- c(1, 3, 2, 4, 1, 3, 1)
ainb(a, b) # thelatemail
valInLong(a, b) # dason
f(a, b) # pete
h1(a, b)
h2(a, b)
ainb(a, c)
valInLong(a, c)
f(a, c)
h1(a, c)
h2(a, c)
library(microbenchmark)
microbenchmark(ainb(a, b),
valInLong(a, b),
f(a, b),
h1(a, b),
h2(a, b),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# ainb(a, b) 201.471 202.611 223.5567 211.7350 223.139 318.932 10 c
# valInLong(a, b) 67.664 76.407 90.2437 89.5215 99.215 129.245 10 b
# f(a, b) 36.873 42.195 54.2833 44.2860 55.879 129.246 10 a
# h1(a, b) 22.809 25.470 32.1595 27.1795 28.510 74.887 10 a
# h2(a, b) 20.147 22.048 31.7794 24.5190 26.609 96.174 10 a
# vectors from @Dason's answer
val <- c(123421, 232, 23423)
long <- rpois(1000, 5)
microbenchmark(ainb(val, long),
valInLong(val, long),
f(val, long),
h1(val, long),
h2(val, long),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# ainb(val, long) 24673.332 24872.522 27732.2673 25685.4380 26962.877 45808.000 10 b
# valInLong(val, long) 50.558 55.880 68.5763 66.7135 81.349 91.233 10 a
# f(val, long) 69.945 80.588 89.1036 88.9515 99.215 115.561 10 a
# h1(val, long) 387.737 391.158 432.3644 421.5685 458.062 524.585 10 a
# h2(val, long) 337.559 342.120 378.1190 378.0425 382.035 458.442 10 a
# longer 'val' and 'long' vectors
val <- rpois(100, 5)
long <- rpois(10000, 5)
microbenchmark(ainb(val, long),
valInLong(val, long),
f(val, long),
h1(val, long),
h2(val, long),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# ainb(val, long) 298.967481 312.962860 322.350298 322.219875 329.194565 350.080246 10 b
# valInLong(val, long) 5.065280 5.237861 5.533719 5.532845 5.843414 5.921341 10 a
# f(val, long) 1.679050 1.717064 1.763288 1.747284 1.779786 1.907891 10 a
# h1(val, long) 3.648523 3.664869 3.751121 3.707634 3.753820 4.153720 10 a
# h2(val, long) 3.366463 3.444010 3.616591 3.478413 3.758761 4.309955 10 a
答案 4 :(得分:2)
这是@thelatemail作为中缀运营商的聪明答案的变体:
`%w/in%` <- function(a, b)
{
i <- length(a)
x <- 1:(length(b)-(i-1))
y <- x + (i-1)
any(apply(cbind(x, y), 1, function(r) all(a == b[r[1]:r[2]])))
}
它设置一组索引来迭代b
,然后传递这些索引以查看所选子集是否全部相等。因为它在迭代之前创建这些索引,所以在大向量中它可能是低效的。这是在行动。
> a <- c(2, 3)
> b <- c(1, 3, 2, 4, 2, 3, 1)
> c <- c(1, 3, 2, 4, 1, 3, 1)
>
> a %w/in% b
[1] TRUE
> a %w/in% c
[1] FALSE
对于它的价值,这个版本似乎要快得多(经过非常简短的测试):
> a <- c(2, 3, 1)
> b <- sample(1:4, 1000, replace=TRUE)
> a %w/in% b
[1] TRUE
> ainb(a, b)
[1] TRUE
> system.time(replicate(1000, a %w/in% b))
user system elapsed
11.175 0.000 11.187
> system.time(replicate(1000, ainb(a, b)))
user system elapsed
19.930 0.000 19.949
答案 5 :(得分:1)
一种方法是在所有可能的指数上穷举搜索较长的向量,以获得与较短向量长度相等的一系列匹配。我怀疑这种方式对于非常大的问题是有效的,并怀疑字符串转换 - 并且还试图简化我自己的答案! - 值得调查,但是......
compareTuple <- function(v.lng, v.shrt, idx)
{
#idx is starting index of v.lng to begin comparison
len = length(v.shrt)
prod(v.lng[idx:(idx+len-1)] == v.shrt)
}
containsTuple <- function(v.lng, v.shrt)
{
as.logical(sum(sapply(
FUN = function(x){prod(compareTuple(v.lng, v.shrt, x))},
X = 1:(length(v.lng)-length(v.shrt)+1)
)))
}
应该做的伎俩。结果如下:
a = c(2, 3); b = c(1, 3, 2, 4, 2, 3, 1); c = c(1, 3, 2, 4, 1, 3, 1)
> containsTuple(c,a)
[1] FALSE
> containsTuple(b,a)
[1] TRUE