我有两个载体:
a <- c(1,1,2,3,4,4,4,4,5,6)
b <- c(T,F,T,F,T,T,F,F,F,T)
我想有一个向量,告诉我b
中a
中每个唯一值有多少TRUE。 (第二栏)
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 3 0
[4,] 4 2
[5,] 5 0
[6,] 6 1
我能来到这里的最好的就是使用sapply:
sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
这很好,但对于较大的矢量,它相当慢。 (我尝试了一些子集变体。)
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
microbenchmark::microbenchmark(
subset = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
, iN = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a %in% uniqueA & b), a = a, b = b)
, equal = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a == uniqueA & b), a = a, b = b)
, times = 5
)
Unit: milliseconds
expr min lq mean median uq max neval
subset 389.1995 390.6002 413.6969 393.0396 445.6553 449.9897 5
iN 2746.8407 2798.0462 2797.3155 2806.9477 2814.6317 2820.1110 5
equal 1080.3430 1089.2507 1111.0267 1096.8082 1135.1957 1153.5358 5
有没有人知道如何更快地完成这项工作?
答案 0 :(得分:3)
您可以使用aggregate
:
aggregate(b, list(a), sum)
为了获得最快的表现,我建议data.table
。设置需要更长的时间,但对于大量数据,性能应该非常好。
library(data.table)
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
速度测试比较(1)聚合,(2)sapply,(3)data.table,(4)tapply:
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
summarize_dt <- function(x) {
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
}
microbenchmark::microbenchmark(
aggregate = aggregate(b, list(a), sum),
sapply = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b),
datatable = summarize_dt(),
tapply = tapply(b, a, sum)
)
#expr min lq mean median uq max neval
#aggregate 130.995347 133.672041 141.404597 135.301762 137.199151 213.730345 100
#sapply 335.344866 357.387474 394.432339 411.994214 425.604144 486.548520 100
#datatable 1.540011 1.914712 2.430220 2.027578 2.239999 5.297593 100
#tapply 3.075646 3.627395 4.719595 4.089434 5.934675 8.758332 100
看起来data.table
是最快的
答案 1 :(得分:1)
这个可能在基础R中使用table
:
t <- table(a[b])
z <- as.numeric(names(t))
rbind(unname(cbind(z, t)), cbind(setdiff(unique(a),z),0))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 1
# [3,] 4 2
# [4,] 6 1
# [5,] 3 0
# [6,] 5 0
如果你想要那些TRUE
非零的人,那么table(a[b])
就足够了。
答案 2 :(得分:1)
或者我们可以使用tidyverse
library(tidyverse)
tibble(a, b) %>%
group_by(a) %>%
summarise(b = sum(b))
基础R选项
rowsum(+b, a)