考虑到两个因素(每个因素都有相同的水平),比如说
lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE))
B <- factor(sample(lev, 6000, TRUE))
我想采用他们的外部产品,定制产品功能,定义如下:
mapping <- matrix(c(
"green", "blue", "blue", "red", "red", "red",
"blue", "green", "blue", "red", "red", "red",
"blue", "blue", "green", "red", "red", "red",
"red", "red", "red", "green", "yellow", "red",
"red", "red", "red", "yellow", "green", "red",
"red", "red", "red", "red", "red", "green"),
nrow=6, ncol=6,
dimnames=list(lev, lev))
mapper <- function (X, Y) mapping[matrix(c(levels(X)[X], levels(Y)[Y]),
ncol=2, byrow=TRUE)]
A.B <- outer(A, B, FUN=mapper)
预期的输出(对于大大减少的测试用例)应该是
> A
[1] alpha foxtrot echo charlie echo foxtrot bravo delta charlie
Levels: alpha bravo charlie delta echo foxtrot
> B
[1] alpha foxtrot delta bravo bravo alpha alpha bravo alpha
Levels: alpha bravo delta foxtrot
> outer(A, B, mapper)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "red" "red" "red" "red" "red" "green" "green" "green" "green"
[2,] "red" "red" "red" "red" "red" "green" "green" "green" "green"
[3,] "red" "red" "red" "red" "red" "green" "green" "green" "green"
[4,] "red" "red" "red" "red" "red" "green" "green" "green" "green"
[5,] "blue" "blue" "blue" "blue" "blue" "red" "green" "green" "blue"
[6,] "red" "red" "red" "red" "green" "green" "green" "green" "green"
[7,] "red" "red" "red" "red" "green" "green" "green" "green" "green"
[8,] "red" "red" "red" "red" "green" "green" "green" "green" "green"
[9,] "red" "red" "red" "red" "green" "green" "green" "green" "green"
这种方法有效,但在完全规模上却令人不快:
> system.time(outer(A, B, mapper))
user system elapsed
11.381 5.015 17.653
有人可以推荐更快的方式吗?如果它有帮助,映射矩阵保证是三角形的(即mapping[a,b] == mapping[b,a]
∀a,b。)
答案 0 :(得分:4)
编辑:在我回答时,问题似乎发生了巨大变化,但无论如何都要留在这里。
我假设@joran的评论是正确的,你的意思是(并在lev
修正订单)
lev <- c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)
此外,mapping
不是二维数组(矩阵),也不是像你想象的那样的嵌套数据结构(列表列表)
> mapping
alpha.alpha alpha.bravo alpha.charlie alpha.delta alpha.echo
"green" "blue" "blue" "red" "red"
alpha.foxtrot bravo.alpha bravo.bravo bravo.charlie bravo.delta
"red" "blue" "green" "blue" "red"
bravo.echo bravo.foxtrot charlie.alpha charlie.bravo charlie.charlie
"red" "red" "blue" "blue" "green"
charlie.delta charlie.echo charlie.foxtrot delta.alpha delta.bravo
"red" "red" "red" "red" "red"
delta.charlie delta.delta delta.echo delta.foxtrot echo.alpha
"red" "green" "yellow" "red" "red"
echo.bravo echo.charlie echo.delta echo.echo echo.foxtrot
"red" "red" "yellow" "red" "red"
foxtrot.alpha foxtrot.bravo foxtrot.charlie foxtrot.delta foxtrot.echo
"red" "red" "red" "red" "red"
foxtrot.foxtrot
"green"
现在,如果您想将其存储为列表列表:
mapping <- list(
"alpha" = list("alpha"="green", "bravo"="blue", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"bravo" = list("alpha"="blue", "bravo"="green", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"charlie" = list("alpha"="blue", "bravo"="blue", "charlie"="green",
"delta"="red", "echo"="red", "foxtrot"="red"),
"delta" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="green", "echo"="yellow", "foxtrot"="red"),
"echo" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="yellow", "echo"="red", "foxtrot"="red"),
"foxtrot" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="red", "echo"="red", "foxtrot"="green")
)
mapper = function(X, Y) mapping[[levels(X)[X]]][[levels(Y)[Y]]]
请注意,我在创建list
时使用c
而不是mapping
,mapper
使用提取器([[
)而不是子集( [
)符号。
选中此选项适用于单个值:
> mapper(A[1], B[1])
[1] "red"
只有少数几个值:
> mapper(A[1:2], B[1:2])
Error in mapping[[levels(X)[X]]][[levels(Y)[Y]]] :
attempt to select more than one element
所以我们看到mapper
没有矢量化(因为它必须是)。来自outer
的帮助页面:
使用这两个扩展向量作为参数调用
FUN
。因此,它必须是一个矢量化函数(或一个名称),期望至少有两个参数。
向量化它的简单但不一定有效的方法:
> Vectorize(mapper)(A[1:2], B[1:2])
[1] "red" "green"
这现在适用于一个子集:
> outer(A[1:6], B[1:6], FUN=Vectorize(mapper))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "red" "yellow" "red" "red" "red" "red"
[2,] "red" "green" "red" "red" "red" "yellow"
[3,] "red" "green" "red" "red" "red" "yellow"
[4,] "blue" "red" "blue" "red" "blue" "red"
[5,] "green" "red" "green" "red" "green" "red"
[6,] "red" "red" "red" "green" "red" "red"
让我们检查一下时间:
> system.time(outer(A[1:6], B[1:6], FUN=Vectorize(mapper)))
user system elapsed
0 0 0
> system.time(outer(A[1:60], B[1:60], FUN=Vectorize(mapper)))
user system elapsed
0.22 0.00 0.22
> system.time(outer(A[1:600], B[1:600], FUN=Vectorize(mapper)))
user system elapsed
23.97 0.01 24.01
看外部产品的长度是线性的,或A或B的长度是二次的。我没有等待40分钟,看看6000x6000是否能正常工作。
我们可以提高效率吗?双重索引到递归结构(然后必须在其上使用Vectorize
)效率不高。让我们使用不同的数据结构:二维数组(矩阵)并使用基于矩阵的索引。
mapping <- matrix(c("green", "blue", "blue", "red", "red", "red",
"blue", "green", "blue", "red", "red", "red",
"blue", "blue", "green", "red", "red", "red",
"red", "red", "red", "green", "yellow", "red",
"red", "red", "red", "yellow", "red", "red",
"red", "red", "red", "red", "red", "green"),
nrow = 6, ncol = 6,
dimnames = list(lev, lev))
mapper <- function(X, Y) mapping[cbind(as.character(X), as.character(Y))]
并测试此
> A[1:6]
[1] echo delta delta charlie alpha foxtrot
Levels: alpha bravo charlie echo delta foxtrot
> B[1:6]
[1] alpha delta alpha foxtrot alpha echo
Levels: alpha bravo charlie echo delta foxtrot
> mapper(A[1], B[1])
[1] "red"
> mapper(A[1:2], B[1:2])
[1] "red" "green"
> outer(A[1:6], B[1:6], FUN=mapper)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "red" "yellow" "red" "red" "red" "red"
[2,] "red" "green" "red" "red" "red" "yellow"
[3,] "red" "green" "red" "red" "red" "yellow"
[4,] "blue" "red" "blue" "red" "blue" "red"
[5,] "green" "red" "green" "red" "green" "red"
[6,] "red" "red" "red" "green" "red" "red"
看起来不错。检查时间:
> system.time(outer(A[1:6], B[1:6], FUN=mapper))
user system elapsed
0 0 0
> system.time(outer(A[1:60], B[1:60], FUN=mapper))
user system elapsed
0 0 0
> system.time(outer(A[1:600], B[1:600], FUN=mapper))
user system elapsed
0.22 0.00 0.22
> system.time(outer(A, B, FUN=mapper))
user system elapsed
7.80 1.48 9.30
超过9秒,而不是~40分钟,大约250倍的加速。
答案 1 :(得分:2)
所以,你的mapping
变量并不完全正确。如果你看看
str(mapping)
# Named chr [1:36] "green" "blue" "blue" "red" "red" "red" ...
# - attr(*, "names")= chr [1:36] "alpha.alpha" "alpha.bravo" "alpha.charlie" ...
你会看到它是一维的字符向量。在那里,元素的名称用“。”粘贴在一起。我假设这不是你想要的。也许你曾使用list()
而不是c()
?但是如果你可以控制格式,为什么不使用简单的矩阵
mapping <- structure(c("green", "blue", "blue", "red", "red", "red", "blue",
"green", "blue", "red", "red", "red", "blue", "blue", "green",
"red", "red", "red", "red", "red", "red", "green", "yellow",
"red", "red", "red", "red", "yellow", "red", "red", "red", "red",
"red", "red", "red", "green"), .Dim = c(6L, 6L), .Dimnames = list(
c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
), c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
)))
因此lev
的每个值都有一行和一列,并且单元格的颜色是组合的颜色。
然后,如果你这样做
#sample data
lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)
#run mapping
out <- outer(A, B, FUN=function(a,b) mapping[cbind(a,b)])
现在out将沿着cols的行和B值具有A值,并且将两者之间的相互作用的正确颜色作为单元格值。这很快就会运行
system.time(outer(A, B, FUN=function(a,b) mapping[cbind(a,b)]))
# user system elapsed
# 0.90 0.25 1.15
答案 2 :(得分:0)
我相信这可以做你想要的,大约4秒钟(如果你不添加名字,运行速度大约快4倍,但结果看起来并不好)。请注意,非常重要:仅当A
和B
的级别相同且级别与mapping.mx
的名称相同时才有效。这是因为cbind
强制数字因素,因此映射是位置的。如果不是这种情况,您可以强制A
和B
加入角色,它会起作用,但速度会更慢。
names(A) <- A
names(B) <- B
mapping.mx <- do.call(rbind, mapping.lst) # see below for mapping.lst
system.time(res <- outer(A, B, function(x, y) mapping.mx[cbind(x, y)]))
# user system elapsed
# 3.33 0.62 3.95
str(res)
# chr [1:6000, 1:6000] "red" "green" "green" "blue" "green" "blue" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:6000] "delta" "alpha" "alpha" "bravo" ...
# ..$ : chr [1:6000] "alpha" "alpha" "echo" "delta" ...
res[1:5, 1:5]
# alpha alpha echo delta charlie
# delta "red" "red" "yellow" "red" "red"
# alpha "green" "green" "red" "red" "blue"
# alpha "green" "green" "red" "red" "blue"
# bravo "blue" "blue" "red" "red" "blue"
# alpha "green" "green" "red" "red" "blue"
mapping.lst
(基本上与您的相同,但将第一个c
更改为list
):
mapping.lst <- list(
"alpha" = c("alpha"="green", "bravo"="blue", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"bravo" = c("alpha"="blue", "bravo"="green", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"charlie" = c("alpha"="blue", "bravo"="blue", "charlie"="green",
"delta"="red", "echo"="red", "foxtrot"="red"),
"delta" = c("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="green", "echo"="yellow", "foxtrot"="red"),
"echo" = c("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="yellow", "echo"="red", "foxtrot"="red"),
"foxtrot" = c("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="red", "echo"="red", "foxtrot"="green")
)