查找行方式最小值和返回值及其对应的列名

时间:2015-08-03 22:34:33

标签: r

考虑以下矩阵m

    ca bsa rd zaa
ada  3   4  3   2
aca  1   4  5   2
ara  3   4  3   2
ava  3   4  5   2

我正在尝试为每一行找到最小值并以下列形式返回data.frame:

    q   s d
1 ada zaa 2
2 aca  ca 1
3 ara zaa 2
4 ava zaa 2

现在我正在做:

res <- t(sapply(seq(nrow(m)), function(i) {
  j <- which.min(m[i,])
  c(q = rownames(m)[i],
    s = colnames(m)[j],
    d = m[i,j])}))

res <- data.frame(res)
res$d <- as.numeric(res$d)

我正在寻找一种更好的方法来构建它。

使用res构建c()(将所有组件强制转换为相同类型)然后将其转换为data.frame并最终按顺序将d更改为数字感觉效率相当低获得以下结构:

'data.frame':   4 obs. of  3 variables:
 $ q: Factor w/ 4 levels "aca","ada","ara",..: 2 1 3 4
 $ s: Factor w/ 2 levels "ca","zaa": 2 1 2 2
 $ d: num  2 1 2 2 

我还需要处理可能存在多个最小值的情况

数据

m <- structure(c(3, 1, 3, 3, 4, 4, 4, 4, 3, 5, 3, 5, 2, 2, 2, 2), .Dim = c(4L, 
4L), .Dimnames = list(c("ada", "aca", "ara", "ava"), c("ca", "bsa", "rd", "zaa")))

4 个答案:

答案 0 :(得分:5)

您可以在矩阵的否定版本上使用max.col找到每行最小值的列索引。

col_id <- max.col(-m)
data.frame(q = rownames(m), s = colnames(m)[col_id],
           d = m[cbind(1:length(col_id), col_id)])
#     q   s d
# 1 ada zaa 2
# 2 aca  ca 1
# 3 ara zaa 2
# 4 ava zaa 2

答案 1 :(得分:3)

我可能会这样做:

cn<-colnames(m)
data.frame(t(apply(m,1,function(x){y<-which.min(x);
           c(cn[y],x[y])})))
     X1 X2
ada zaa  2
aca  ca  1
ara zaa  2
ava zaa  2

节省时间的关键事项:只计算rownames一次(在循环外)并且每次迭代只计算一次min

以下是迄今为止的选项基准:

set.seed(10151)
NN<-1000
m<-matrix(sample(NN,NN^2,T),ncol=NN,
          dimnames=list(replicate(NN,paste0(sample(letters,3,T),collapse="")),
                        replicate(NN,paste0(sample(letters,3,T),collapse=""))))

library(microbenchmark)
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
  steven  26.35880  28.46085  30.84153  29.51562  30.82366  80.18648    50
  henrik  16.83619  17.47235  19.14324  18.73855  20.01433  26.63396    50
   mikec  25.25390  27.06398  28.69647  28.25848  29.46897  37.15669    50
    rawr 110.00786 118.80357 150.76281 128.73180 176.67976 292.00051    50
 richard  23.44658  25.49981  27.86844  26.72024  27.62607  78.14996    50
   akrun  26.65337  28.12803  35.52941  29.33611  33.54546  83.34182    50
  andres  53.05468  59.54172  71.10404  61.85948  71.27818 149.76439    50

*注意:henrik有一个拼写错误,我在之前的基准测试中错误地修正了。对于akrun我正在使用优化的{res<-setDT(melt(m)); res[res[, .I[which.min(value)] ,.(Var1)]$V1]}

启动难度,设置NN<-2000(两次最慢删除以节省时间):

Unit: milliseconds
    expr       min        lq      mean    median        uq      max neval
  steven 112.80108 114.06360 115.73233 115.29611 116.97757 122.3215    50
  henrik  67.16095  70.17341  93.84658  98.30461  99.96561 162.4522    50
   mikec 107.81738 110.24776 117.01182 111.64840 114.39962 166.1335    50
 richard 101.08277 104.76309 115.75823 105.96692 107.78915 206.8925    50
   akrun 101.65822 131.51744 159.14601 165.14284 183.04740 236.5955    50

如果你不相信,NN<-5000

Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval
  henrik 413.3938 422.7162 450.3574 432.1532 465.9551 707.6048    50
   mikec 705.4221 725.0111 764.4510 742.2715 801.8704 901.3484    50
 richard 695.7005 716.7905 754.1729 732.5105 778.5526 902.7917    50

答案 2 :(得分:2)

也许像

w <- apply(m, 1, which.min)
data.frame(
    q = rownames(m),
    s = colnames(m)[w],
    d = m[cbind(seq_along(w), w)]
) 
#     q   s d
# 1 ada zaa 2
# 2 aca  ca 1
# 3 ara zaa 2
# 4 ava zaa 2

答案 3 :(得分:1)

这是另一个。也许更聪明的人可以想出一个单行,但这是我能做的最好的,妈妈

m <- structure(c(3, 1, 3, 3, 4, 4, 4, 4, 3, 5, 3, 5, 2, 2, 2, 2),.Dim = c(4L, 4L),.Dimnames = list(c("ada", "aca", "ara", "ava"), c("ca", "bsa", "rd", "zaa")))

(m2 <- data.frame(as.table(m * (apply(m, 1, min) == m))))
m2[m2$Freq > 0, ]

#    Var1 Var2 Freq
# 2   aca   ca    1
# 13  ada  zaa    2
# 15  ara  zaa    2
# 16  ava  zaa    2