用于测试短数字向量是否是R中长数字向量的一部分的函数

时间:2015-02-03 00:29:38

标签: r

我正在尝试测试短数字向量是否是较长数值向量的一部分。例如,如果a = c(2, 3)b = c(1, 3, 2, 4, 2, 3, 1),那么我试图找到/想出一个能回答问题的函数:ab的一部分?输出应为TRUE

或者,如果c = c(1, 3, 2, 4, 1, 3, 1),那么"的输出是a c的一部分?"应该是FALSE

match()无法完成这项工作:

match(a, b)

返回

3  2

%in%运营商也没有:

TRUE  TRUE

我知道有字符串匹配的选项,但我不想通过转换为字符串解决此问题...

6 个答案:

答案 0 :(得分:6)

这是我对它的抨击

valInLong <- function(val, long){
  n.long <- length(long)
  n.val <- length(val)
  # Find where in the longer vector the first
  # element of val is.  This is so we can vectorize later
  first <- which(long == val[1])
  # If the first element is too near the end we don't care
  # about it
  first <- first[first <= n.long - n.val + 1]
  # sequence from 0 to n.val - 1 used for grabbing subsequences
  se <- seq_along(val)-1
  # Look at all subsequences starting at 'first' that go
  # the length of val and do an elementwise comparison.
  # If any match in all positions then the subsequence is
  # the sequence of interest.
  any(sapply(first, function(x){all(long[x+se] == val)}))
}


long <- rpois(1000, 5)
a <- c(123421, 232, 23423) # probably not in long

valInLong(a, long)
a <- long[34:100]
valInLong(a, long)

答案 1 :(得分:5)

这是一次尝试。我不认为它超级快,但它也不是超慢:

a  = c(2,3)
b1 = c(1, 3, 2, 4, 2, 3, 1)
b2 = c(1, 3, 2, 4, 1, 3, 1)

ainb <- function(a,b) {
  any(apply( embed(b,length(a)), 1, function(x) all(rev(a)==x) ))
}
ainb(a,b1)
#[1] TRUE
ainb(a,b2)
#[1] FALSE

答案 2 :(得分:3)

如果你的载体不会太长,你总是可以强制它:

f <- function(a, b) {

    if(length(a)==0) return(TRUE)

    ix <- seq_along(b)

    for(i in seq_along(a)) {

        ix <- ix[which(a[i] == b[ix + i - 1])]
    }

    length(ix) > 0
}

f(a, b)
# [1] TRUE
f(a, c)
# [1] FALSE

答案 3 :(得分:3)

鉴于OP写道“我不想通过转换为字符串来解决这个问题...”,@ thelatemail的评论(“转换为字符串有时可能会比其他解决方案慢很多。但是我绝对会根据人们提出的解决方案来保留我的判断。“)我有点好奇,看看基于字符串的解决方案是如何执行的。看起来并不太糟糕。

我在这里使用base greplstringi等效stri_detect_fixed。它们对于原始(短)向量来说是最快的。 @ Dason的解决方案对于中等大小的向量来说是最快的,for - 循环对于'长'向量来说是最快的。

h1 <- function(val, long){
  grepl(pattern = paste0(val, collapse = ","), x = paste0(long, collapse = ","))
}

library(stringi)
h2 <- function(val, long){
  stri_detect_fixed(str = paste0(long, collapse = ","), pattern = paste0(val, collapse = ","))
}


a <- c(2, 3)
b <- c(1, 3, 2, 4, 2, 3, 1)
c <- c(1, 3, 2, 4, 1, 3, 1)

ainb(a, b) # thelatemail
valInLong(a, b) # dason
f(a, b) # pete
h1(a, b)
h2(a, b)

ainb(a, c)
valInLong(a, c)
f(a, c)
h1(a, c)
h2(a, c)

library(microbenchmark)
microbenchmark(ainb(a, b),
               valInLong(a, b),
               f(a, b),
               h1(a, b),
               h2(a, b),
               times = 10)
# Unit: microseconds
#            expr     min      lq     mean   median      uq     max neval cld
#      ainb(a, b) 201.471 202.611 223.5567 211.7350 223.139 318.932    10   c
# valInLong(a, b)  67.664  76.407  90.2437  89.5215  99.215 129.245    10  b 
#         f(a, b)  36.873  42.195  54.2833  44.2860  55.879 129.246    10 a  
#        h1(a, b)  22.809  25.470  32.1595  27.1795  28.510  74.887    10 a  
#        h2(a, b)  20.147  22.048  31.7794  24.5190  26.609  96.174    10 a 


# vectors from @Dason's answer
val <- c(123421, 232, 23423)
long <- rpois(1000, 5)
microbenchmark(ainb(val, long),
               valInLong(val, long),
               f(val, long),
               h1(val, long),
               h2(val, long),
               times = 10)
# Unit: microseconds
#                 expr       min        lq       mean     median        uq       max neval cld
#      ainb(val, long) 24673.332 24872.522 27732.2673 25685.4380 26962.877 45808.000    10   b
# valInLong(val, long)    50.558    55.880    68.5763    66.7135    81.349    91.233    10  a 
#         f(val, long)    69.945    80.588    89.1036    88.9515    99.215   115.561    10  a 
#        h1(val, long)   387.737   391.158   432.3644   421.5685   458.062   524.585    10  a 
#        h2(val, long)   337.559   342.120   378.1190   378.0425   382.035   458.442    10  a


# longer 'val' and 'long' vectors
val <- rpois(100, 5)
long <- rpois(10000, 5)
microbenchmark(ainb(val, long),
               valInLong(val, long),
               f(val, long),
               h1(val, long),
               h2(val, long),
               times = 10)
# Unit: milliseconds
#                 expr        min         lq       mean     median         uq        max neval cld
#      ainb(val, long) 298.967481 312.962860 322.350298 322.219875 329.194565 350.080246    10   b
# valInLong(val, long)   5.065280   5.237861   5.533719   5.532845   5.843414   5.921341    10  a 
#         f(val, long)   1.679050   1.717064   1.763288   1.747284   1.779786   1.907891    10  a 
#        h1(val, long)   3.648523   3.664869   3.751121   3.707634   3.753820   4.153720    10  a 
#        h2(val, long)   3.366463   3.444010   3.616591   3.478413   3.758761   4.309955    10  a

答案 4 :(得分:2)

这是@thelatemail作为中缀运营商的聪明答案的变体:

`%w/in%` <- function(a, b)
{
    i <- length(a)
    x <- 1:(length(b)-(i-1))
    y <- x + (i-1)

    any(apply(cbind(x, y), 1, function(r) all(a == b[r[1]:r[2]])))
}

它设置一组索引来迭代b,然后传递这些索引以查看所选子集是否全部相等。因为它在迭代之前创建这些索引,所以在大向量中它可能是低效的。这是在行动。

> a <- c(2, 3)
> b <- c(1, 3, 2, 4, 2, 3, 1)
> c <- c(1, 3, 2, 4, 1, 3, 1)
> 
> a %w/in% b
[1] TRUE
> a %w/in% c
[1] FALSE

对于它的价值,这个版本似乎要快得多(经过非常简短的测试):

> a <- c(2, 3, 1)
> b <- sample(1:4, 1000, replace=TRUE)
> a %w/in% b
[1] TRUE
> ainb(a, b)
[1] TRUE
> system.time(replicate(1000, a %w/in% b))
   user  system elapsed 
 11.175   0.000  11.187 
> system.time(replicate(1000, ainb(a, b)))
   user  system elapsed 
 19.930   0.000  19.949 

答案 5 :(得分:1)

一种方法是在所有可能的指数上穷举搜索较长的向量,以获得与较短向量长度相等的一系列匹配。我怀疑这种方式对于非常大的问题是有效的,并怀疑字符串转换 - 并且还试图简化我自己的答案! - 值得调查,但是......

compareTuple <- function(v.lng, v.shrt, idx)
    {
    #idx is starting index of v.lng to begin comparison
    len = length(v.shrt)
    prod(v.lng[idx:(idx+len-1)] == v.shrt)
    }

containsTuple <- function(v.lng, v.shrt)
    {
    as.logical(sum(sapply(
                        FUN = function(x){prod(compareTuple(v.lng, v.shrt, x))}, 
                        X = 1:(length(v.lng)-length(v.shrt)+1)
                         )))
    }

应该做的伎俩。结果如下:

a = c(2, 3); b = c(1, 3, 2, 4, 2, 3, 1); c = c(1, 3, 2, 4, 1, 3, 1)

> containsTuple(c,a)
[1] FALSE
> containsTuple(b,a)
[1] TRUE