Question

我有以下几点要处理。我有两个矩阵

a b c
b d d
a d b

和

1 2 3
4 5 6
7 8 9

我需要能够从第二个矩阵中确定指定的均值，如下所示：

a与1和7匹配（给出平均值为4）
b匹配2,4和9（平均值为5）
c与3匹配（给出平均值1）
d与5,6和8匹配（给出平均值6.33）

现在这两个矩阵相当简单，我必须使用的矩阵在100 x 100的范围内。

欢迎任何想法

谢谢

Answer 1

您可以在此处使用简单的tapply。例如

#sample input
m1<-matrix(letters[c(1,2,1,2,4,4,3,4,2)], ncol=3)
m2<-matrix(1:9, byrow=T, ncol=3)

tapply(m2, m1, mean)
#        a        b        c        d 
# 4.000000 5.000000 3.000000 6.333333

只要尺寸完全匹配，它们在矩阵中的事实并不重要。

Answer 2

您可以将data.table用于更大的数据集

 library(data.table)
 dt1 <- data.table(c(m1),c(m2)) #@MrFlick's datasets 
 dt1[,mean(V2), by=V1]
 dt1[,list(V2=mean(V2)), by=V1]
 #   V1       V2
 #1:  a 4.000000
 #2:  b 5.000000
 #3:  d 6.333333
 #4:  c 3.000000

速度

set.seed(45)
m1N <- matrix(sample(letters[1:20], 1e3*1e3, replace=TRUE), ncol=1e3)
m2N <- matrix(sample(0:40, 1e3*1e3, replace=TRUE), ncol=1e3)

system.time(res1 <- tapply(m2N, m1N, mean))
#user  system elapsed 
# 7.605   0.004   7.618 

system.time({dt <- data.table(c(m1N), c(m2N))
        setkey(dt, V1)
            res2 <- dt[,mean(V2), by=V1]})
 #user  system elapsed 
 #  0.043   0.000   0.043 

system.time(res3 <- unlist(lapply(split(m2N, m1N),mean)))
#  user  system elapsed 
# 7.864   0.016   7.891 

system.time(res4 <- sapply(sort(unique.default(m1N)), function(x) mean(m2N[m1N == x])))
# user  system elapsed 
# 1.007   0.012   1.021

Answer 3

由于tapply调用split，在大型矩阵上，您可能会发现直接使用split

更有效

> unlist(lapply(split(m, m2), mean)) 
### or slightly slower: sapply(split(m, m2), mean)
#        a        b        c        d 
# 4.000000 5.000000 3.000000 6.333333

，其中

> m <- structure(c(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L), .Dim = c(3L,3L))
> m2 <- structure(c("a","b","a","b","d","d","c","d","b"), .Dim = c(3L, 3L))

快速检查：

> f <- function() tapply(m, m2, mean)
> g <- function() unlist(lapply(split(m, m2), mean))
> library(microbenchmark)
> microbenchmark(f(), g(), times = 1e4L)
# Unit: microseconds
#  expr     min       lq   median      uq       max neval
#   f() 421.083 432.2575 436.3975 440.503  3633.401 10000
#   g() 267.119 277.1495 280.2180 283.982 69714.687 10000

Answer 4

使用聚合，可能是最慢的......

aggregate(m2~m1,mean,data=data.frame(m1=c(m1),m2=c(m2)))
# m1       m2
# 1  a 4.000000
# 2  b 5.000000
# 3  c 3.000000
# 4  d 6.333333

匹配矩阵元素

4 个答案:

速度