首先,我很确定此问题已经得到解答,但搜索条件似乎难以触及,如果有重复内容,请道歉。
说我有一系列因素:
all <- factor(letters)
我继续使用这些因子级别的所有组合作为建模管道的一部分:
combos <- t(combn(as.character(all), 5))
head(combos)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "a" "b" "c" "d" "e"
# [2,] "a" "b" "c" "d" "f"
# [3,] "a" "b" "c" "d" "g"
# ...
我的问题是:如何将第二个矩阵转换为显示所有级别存在/不存在的矩阵,如:
a b c d e f g ...
[1,] 1 1 1 1 1 0 0 ...
[2,] 1 1 1 1 0 1 0 ...
[3,] 1 1 1 1 0 0 1 ...
...
就我的尝试而言,我的第一个想法是ifelse
使用apply
的行式应用,但我无法将任何可行的东西放在一起。这样做的聪明方法是什么?
答案 0 :(得分:3)
这是我的尝试:
combos.out <- t(apply(combos, 1, function(x) table(factor(x, levels = letters))))
head(combos.out)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WRT @Ananda Mahto的评论,通过转型和分解的操纵肯定会减慢事情 - 一个快速而肮脏的基准:
#Unit: milliseconds
# expr min lq median uq max neval
# forfun(combos) 416.6027 534.6973 652.7919 718.4231 784.0544 3
# applyfun(combos) 13892.7020 15755.8570 17619.0121 22559.8271 27500.6421 3
为for
循环获得一分!
答案 1 :(得分:3)
您可以使用矩阵索引来获得更好的速度。这是一个非常改进的解决方案,不使用for循环。
all <- factor(letters)
combos <- t(combn(as.character(all), 5))
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L
orig <- function() {
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
for (i in 1:nrow(combos)) {
x[i, combos[i, ]] <- 1
}
x
}
new <- function() {
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L
x
}
identical(orig(), new())
# [1] TRUE
library(microbenchmark)
microbenchmark(orig(), new(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# orig() 476.85206 486.11091 497.48429 512.4333 579.2695 20
# new() 87.02026 91.17021 96.88463 111.6414 175.6339 20
在这样的问题中,for
循环可以正常工作并且可以轻松预分配:
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
for (i in 1:nrow(combos)) {
x[i, combos[i, ]] <- 1
}
head(x)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
答案 2 :(得分:2)
一个简单而有效的解决方案:
t(apply(combos,1,function(x){all %in% x}))*1
Ananda Mahto的for循环解决方案仍然快了两倍:
min lq median uq max neval
561.2153 638.4648 643.439 650.7053 1199.857 100
与
min lq median uq max neval
295.8798 305.0586 311.9961 370.6028 406.9336 100