基于第二向量的子集和

时间:2017-05-10 21:24:42

标签: r

我有两个载体:

a <- c(1,1,2,3,4,4,4,4,5,6)
b <- c(T,F,T,F,T,T,F,F,F,T)

我想有一个向量,告诉我ba中每个唯一值有多少TRUE。 (第二栏)

     [,1] [,2]
[1,]    1    1
[2,]    2    1
[3,]    3    0
[4,]    4    2
[5,]    5    0
[6,]    6    1

我能来到这里的最好的就是使用sapply:

sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)

这很好,但对于较大的矢量,它相当慢。 (我尝试了一些子集变体。)

a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)

microbenchmark::microbenchmark(
    subset = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
    , iN = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a %in% uniqueA & b), a = a, b = b)
    , equal = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a == uniqueA & b), a = a, b = b)
    , times = 5
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
 subset  389.1995  390.6002  413.6969  393.0396  445.6553  449.9897     5
     iN 2746.8407 2798.0462 2797.3155 2806.9477 2814.6317 2820.1110     5
  equal 1080.3430 1089.2507 1111.0267 1096.8082 1135.1957 1153.5358     5

有没有人知道如何更快地完成这项工作?

3 个答案:

答案 0 :(得分:3)

您可以使用aggregate

aggregate(b, list(a), sum)  

为了获得最快的表现,我建议data.table。设置需要更长的时间,但对于大量数据,性能应该非常好。

library(data.table)
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]

速度测试比较(1)聚合,(2)sapply,(3)data.table,(4)tapply:

  a <- sample(1:1000, 1e5, replace = TRUE)
  b <- sample(c(T,F), 1e5, replace = TRUE)

  summarize_dt <- function(x) {
    dt <- data.table(a = a, b = b)
    dt[,sum(b), by = a]
  }

  microbenchmark::microbenchmark(
    aggregate = aggregate(b, list(a), sum),
    sapply = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b),
    datatable = summarize_dt(),
    tapply = tapply(b, a, sum)
  )

      #expr        min         lq       mean     median         uq        max neval
 #aggregate 130.995347 133.672041 141.404597 135.301762 137.199151 213.730345   100
    #sapply 335.344866 357.387474 394.432339 411.994214 425.604144 486.548520   100
 #datatable   1.540011   1.914712   2.430220   2.027578   2.239999   5.297593   100
    #tapply   3.075646   3.627395   4.719595   4.089434   5.934675   8.758332   100

看起来data.table是最快的

答案 1 :(得分:1)

这个可能在基础R中使用table

t <- table(a[b])
z <- as.numeric(names(t))
rbind(unname(cbind(z, t)), cbind(setdiff(unique(a),z),0))

    # [,1] [,2]
# [1,]    1    1
# [2,]    2    1
# [3,]    4    2
# [4,]    6    1
# [5,]    3    0
# [6,]    5    0

如果你想要那些TRUE非零的人,那么table(a[b])就足够了。

答案 2 :(得分:1)

或者我们可以使用tidyverse

library(tidyverse)
tibble(a, b) %>% 
       group_by(a) %>%
       summarise(b = sum(b))

基础R选项

rowsum(+b, a)