R因素的外部产品 - 自定义映射 - 方式太慢

时间:2014-07-03 20:55:55

标签: r performance categorical-data

考虑到两个因素(每个因素都有相同的水平),比如说

lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE))
B <- factor(sample(lev, 6000, TRUE))

我想采用他们的外部产品,定制产品功能,定义如下:

mapping <- matrix(c(
    "green", "blue",  "blue",  "red",    "red",    "red",
    "blue",  "green", "blue",  "red",    "red",    "red",
    "blue",  "blue",  "green", "red",    "red",    "red",
    "red",   "red",   "red",   "green",  "yellow", "red",
    "red",   "red",   "red",   "yellow", "green",  "red",
    "red",   "red",   "red",   "red",    "red",    "green"),
    nrow=6, ncol=6,
    dimnames=list(lev, lev))
mapper <- function (X, Y) mapping[matrix(c(levels(X)[X], levels(Y)[Y]),
                                         ncol=2, byrow=TRUE)]
A.B <- outer(A, B, FUN=mapper)

预期的输出(对于大大减少的测试用例)应该是

> A
[1] alpha   foxtrot echo    charlie echo    foxtrot bravo   delta   charlie
Levels: alpha bravo charlie delta echo foxtrot
> B
[1] alpha   foxtrot delta   bravo   bravo   alpha   alpha   bravo   alpha  
Levels: alpha bravo delta foxtrot
> outer(A, B, mapper)
      [,1]   [,2]   [,3]   [,4]   [,5]    [,6]    [,7]    [,8]    [,9]   
 [1,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [2,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [3,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [4,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [5,] "blue" "blue" "blue" "blue" "blue"  "red"   "green" "green" "blue" 
 [6,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [7,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [8,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [9,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"

这种方法有效,但在完全规模上却令人不快:

> system.time(outer(A, B, mapper))
   user  system elapsed 
 11.381   5.015  17.653 

有人可以推荐更快的方式吗?如果它有帮助,映射矩阵保证是三角形的(即mapping[a,b] == mapping[b,a]∀a,b。)

3 个答案:

答案 0 :(得分:4)

编辑:在我回答时,问题似乎发生了巨大变化,但无论如何都要留在这里。


我假设@joran的评论是正确的,你的意思是(并在lev修正订单)

lev <- c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)

此外,mapping不是二维数组(矩阵),也不是像你想象的那样的嵌套数据结构(列表列表)

> mapping
    alpha.alpha     alpha.bravo   alpha.charlie     alpha.delta      alpha.echo 
        "green"          "blue"          "blue"           "red"           "red" 
  alpha.foxtrot     bravo.alpha     bravo.bravo   bravo.charlie     bravo.delta 
          "red"          "blue"         "green"          "blue"           "red" 
     bravo.echo   bravo.foxtrot   charlie.alpha   charlie.bravo charlie.charlie 
          "red"           "red"          "blue"          "blue"         "green" 
  charlie.delta    charlie.echo charlie.foxtrot     delta.alpha     delta.bravo 
          "red"           "red"           "red"           "red"           "red" 
  delta.charlie     delta.delta      delta.echo   delta.foxtrot      echo.alpha 
          "red"         "green"        "yellow"           "red"           "red" 
     echo.bravo    echo.charlie      echo.delta       echo.echo    echo.foxtrot 
          "red"           "red"        "yellow"           "red"           "red" 
  foxtrot.alpha   foxtrot.bravo foxtrot.charlie   foxtrot.delta    foxtrot.echo 
          "red"           "red"           "red"           "red"           "red" 
foxtrot.foxtrot 
        "green" 

现在,如果您想将其存储为列表列表:

mapping <- list(
    "alpha"   = list("alpha"="green", "bravo"="blue", "charlie"="blue",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "bravo"   = list("alpha"="blue", "bravo"="green", "charlie"="blue",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "charlie" = list("alpha"="blue", "bravo"="blue", "charlie"="green",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "delta"   = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="green", "echo"="yellow", "foxtrot"="red"),
    "echo"    = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="yellow", "echo"="red", "foxtrot"="red"),
    "foxtrot" = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="red", "echo"="red", "foxtrot"="green")
)
mapper = function(X, Y) mapping[[levels(X)[X]]][[levels(Y)[Y]]]

请注意,我在创建list时使用c而不是mappingmapper使用提取器([[)而不是子集( [)符号。

选中此选项适用于单个值:

> mapper(A[1], B[1])
[1] "red"

只有少数几个值:

> mapper(A[1:2], B[1:2])
Error in mapping[[levels(X)[X]]][[levels(Y)[Y]]] : 
  attempt to select more than one element

所以我们看到mapper没有矢量化(因为它必须是)。来自outer的帮助页面:

  使用这两个扩展向量作为参数调用

FUN。因此,它必须是一个矢量化函数(或一个名称),期望至少有两个参数。

向量化它的简单但不一定有效的方法:

> Vectorize(mapper)(A[1:2], B[1:2])
[1] "red"   "green"

这现在适用于一个子集:

> outer(A[1:6], B[1:6], FUN=Vectorize(mapper))
     [,1]    [,2]     [,3]    [,4]    [,5]    [,6]    
[1,] "red"   "yellow" "red"   "red"   "red"   "red"   
[2,] "red"   "green"  "red"   "red"   "red"   "yellow"
[3,] "red"   "green"  "red"   "red"   "red"   "yellow"
[4,] "blue"  "red"    "blue"  "red"   "blue"  "red"   
[5,] "green" "red"    "green" "red"   "green" "red"   
[6,] "red"   "red"    "red"   "green" "red"   "red"   

让我们检查一下时间:

> system.time(outer(A[1:6], B[1:6], FUN=Vectorize(mapper)))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:60], B[1:60], FUN=Vectorize(mapper)))
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(outer(A[1:600], B[1:600], FUN=Vectorize(mapper)))
   user  system elapsed 
  23.97    0.01   24.01 

看外部产品的长度是线性的,或A或B的长度是二次的。我没有等待40分钟,看看6000x6000是否能正常工作。

我们可以提高效率吗?双重索引到递归结构(然后必须在其上使用Vectorize)效率不高。让我们使用不同的数据结构:二维数组(矩阵)并使用基于矩阵的索引。

mapping <- matrix(c("green", "blue", "blue", "red", "red", "red", 
                    "blue", "green", "blue", "red", "red", "red", 
                    "blue", "blue", "green", "red", "red", "red",
                    "red", "red", "red", "green", "yellow", "red", 
                    "red", "red", "red", "yellow", "red", "red", 
                    "red", "red", "red", "red", "red", "green"),
                  nrow = 6, ncol = 6,
                  dimnames = list(lev, lev))
mapper <- function(X, Y) mapping[cbind(as.character(X), as.character(Y))]

并测试此

> A[1:6]
[1] echo    delta   delta   charlie alpha   foxtrot
Levels: alpha bravo charlie echo delta foxtrot
> B[1:6]
[1] alpha   delta   alpha   foxtrot alpha   echo   
Levels: alpha bravo charlie echo delta foxtrot
> mapper(A[1], B[1])
[1] "red"
> mapper(A[1:2], B[1:2])
[1] "red"   "green"
> outer(A[1:6], B[1:6], FUN=mapper)
     [,1]    [,2]     [,3]    [,4]    [,5]    [,6]    
[1,] "red"   "yellow" "red"   "red"   "red"   "red"   
[2,] "red"   "green"  "red"   "red"   "red"   "yellow"
[3,] "red"   "green"  "red"   "red"   "red"   "yellow"
[4,] "blue"  "red"    "blue"  "red"   "blue"  "red"   
[5,] "green" "red"    "green" "red"   "green" "red"   
[6,] "red"   "red"    "red"   "green" "red"   "red"   

看起来不错。检查时间:

> system.time(outer(A[1:6], B[1:6], FUN=mapper))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:60], B[1:60], FUN=mapper))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:600], B[1:600], FUN=mapper))
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(outer(A, B, FUN=mapper))
   user  system elapsed 
   7.80    1.48    9.30 

超过9秒,而不是~40分钟,大约250倍的加速。

答案 1 :(得分:2)

所以,你的mapping变量并不完全正确。如果你看看

str(mapping)
#  Named chr [1:36] "green" "blue" "blue" "red" "red" "red" ...
#  - attr(*, "names")= chr [1:36] "alpha.alpha" "alpha.bravo" "alpha.charlie" ...

你会看到它是一维的字符向量。在那里,元素的名称用“。”粘贴在一起。我假设这不是你想要的。也许你曾使用list()而不是c()?但是如果你可以控制格式,为什么不使用简单的矩阵

mapping <- structure(c("green", "blue", "blue", "red", "red", "red", "blue", 
"green", "blue", "red", "red", "red", "blue", "blue", "green", 
"red", "red", "red", "red", "red", "red", "green", "yellow", 
"red", "red", "red", "red", "yellow", "red", "red", "red", "red", 
"red", "red", "red", "green"), .Dim = c(6L, 6L), .Dimnames = list(
    c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
    ), c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
    )))

因此lev的每个值都有一行和一列,并且单元格的颜色是组合的颜色。

然后,如果你这样做

#sample data
lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)

#run mapping
out <- outer(A, B, FUN=function(a,b) mapping[cbind(a,b)])

现在out将沿着cols的行和B值具有A值,并且将两者之间的相互作用的正确颜色作为单元格值。这很快就会运行

system.time(outer(A, B, FUN=function(a,b) mapping[cbind(a,b)]))

#   user  system elapsed 
#   0.90    0.25    1.15 

答案 2 :(得分:0)

我相信这可以做你想要的,大约4秒钟(如果你不添加名字,运行速度大约快4倍,但结果看起来并不好)。请注意,非常重要:仅当AB的级别相同且级别与mapping.mx 的名称相同时才有效。这是因为cbind强制数字因素,因此映射是位置的。如果不是这种情况,您可以强制AB加入角色,它会起作用,但速度会更慢。

names(A) <- A
names(B) <- B

mapping.mx <- do.call(rbind, mapping.lst)  # see below for mapping.lst
system.time(res <- outer(A, B, function(x, y) mapping.mx[cbind(x, y)]))

# user  system elapsed 
# 3.33    0.62    3.95 

str(res)

# chr [1:6000, 1:6000] "red" "green" "green" "blue" "green" "blue" ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:6000] "delta" "alpha" "alpha" "bravo" ...
#  ..$ : chr [1:6000] "alpha" "alpha" "echo" "delta" ...

res[1:5, 1:5]

#       alpha   alpha   echo     delta charlie
# delta "red"   "red"   "yellow" "red" "red"  
# alpha "green" "green" "red"    "red" "blue" 
# alpha "green" "green" "red"    "red" "blue" 
# bravo "blue"  "blue"  "red"    "red" "blue" 
# alpha "green" "green" "red"    "red" "blue"     

mapping.lst(基本上与您的相同,但将第一个c更改为list):

mapping.lst <- list(
  "alpha"   = c("alpha"="green", "bravo"="blue", "charlie"="blue",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "bravo"   = c("alpha"="blue", "bravo"="green", "charlie"="blue",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "charlie" = c("alpha"="blue", "bravo"="blue", "charlie"="green",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "delta"   = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="green", "echo"="yellow", "foxtrot"="red"),
  "echo"    = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="yellow", "echo"="red", "foxtrot"="red"),
  "foxtrot" = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="red", "echo"="red", "foxtrot"="green")
)