考虑以下矩阵m
:
ca bsa rd zaa
ada 3 4 3 2
aca 1 4 5 2
ara 3 4 3 2
ava 3 4 5 2
我正在尝试为每一行找到最小值并以下列形式返回data.frame:
q s d
1 ada zaa 2
2 aca ca 1
3 ara zaa 2
4 ava zaa 2
现在我正在做:
res <- t(sapply(seq(nrow(m)), function(i) {
j <- which.min(m[i,])
c(q = rownames(m)[i],
s = colnames(m)[j],
d = m[i,j])}))
res <- data.frame(res)
res$d <- as.numeric(res$d)
我正在寻找一种更好的方法来构建它。
使用res
构建c()
(将所有组件强制转换为相同类型)然后将其转换为data.frame并最终按顺序将d
更改为数字感觉效率相当低获得以下结构:
'data.frame': 4 obs. of 3 variables:
$ q: Factor w/ 4 levels "aca","ada","ara",..: 2 1 3 4
$ s: Factor w/ 2 levels "ca","zaa": 2 1 2 2
$ d: num 2 1 2 2
我还需要处理可能存在多个最小值的情况
数据
m <- structure(c(3, 1, 3, 3, 4, 4, 4, 4, 3, 5, 3, 5, 2, 2, 2, 2), .Dim = c(4L,
4L), .Dimnames = list(c("ada", "aca", "ara", "ava"), c("ca", "bsa", "rd", "zaa")))
答案 0 :(得分:5)
您可以在矩阵的否定版本上使用max.col
找到每行最小值的列索引。
col_id <- max.col(-m)
data.frame(q = rownames(m), s = colnames(m)[col_id],
d = m[cbind(1:length(col_id), col_id)])
# q s d
# 1 ada zaa 2
# 2 aca ca 1
# 3 ara zaa 2
# 4 ava zaa 2
答案 1 :(得分:3)
我可能会这样做:
cn<-colnames(m)
data.frame(t(apply(m,1,function(x){y<-which.min(x);
c(cn[y],x[y])})))
X1 X2
ada zaa 2
aca ca 1
ara zaa 2
ava zaa 2
节省时间的关键事项:只计算rownames
一次(在循环外)并且每次迭代只计算一次min
。
以下是迄今为止的选项基准:
set.seed(10151)
NN<-1000
m<-matrix(sample(NN,NN^2,T),ncol=NN,
dimnames=list(replicate(NN,paste0(sample(letters,3,T),collapse="")),
replicate(NN,paste0(sample(letters,3,T),collapse=""))))
library(microbenchmark)
Unit: milliseconds
expr min lq mean median uq max neval
steven 26.35880 28.46085 30.84153 29.51562 30.82366 80.18648 50
henrik 16.83619 17.47235 19.14324 18.73855 20.01433 26.63396 50
mikec 25.25390 27.06398 28.69647 28.25848 29.46897 37.15669 50
rawr 110.00786 118.80357 150.76281 128.73180 176.67976 292.00051 50
richard 23.44658 25.49981 27.86844 26.72024 27.62607 78.14996 50
akrun 26.65337 28.12803 35.52941 29.33611 33.54546 83.34182 50
andres 53.05468 59.54172 71.10404 61.85948 71.27818 149.76439 50
*注意:henrik
有一个拼写错误,我在之前的基准测试中错误地修正了。对于akrun
我正在使用优化的{res<-setDT(melt(m)); res[res[, .I[which.min(value)] ,.(Var1)]$V1]}
启动难度,设置NN<-2000
(两次最慢删除以节省时间):
Unit: milliseconds
expr min lq mean median uq max neval
steven 112.80108 114.06360 115.73233 115.29611 116.97757 122.3215 50
henrik 67.16095 70.17341 93.84658 98.30461 99.96561 162.4522 50
mikec 107.81738 110.24776 117.01182 111.64840 114.39962 166.1335 50
richard 101.08277 104.76309 115.75823 105.96692 107.78915 206.8925 50
akrun 101.65822 131.51744 159.14601 165.14284 183.04740 236.5955 50
如果你不相信,NN<-5000
:
Unit: milliseconds
expr min lq mean median uq max neval
henrik 413.3938 422.7162 450.3574 432.1532 465.9551 707.6048 50
mikec 705.4221 725.0111 764.4510 742.2715 801.8704 901.3484 50
richard 695.7005 716.7905 754.1729 732.5105 778.5526 902.7917 50
答案 2 :(得分:2)
也许像
w <- apply(m, 1, which.min)
data.frame(
q = rownames(m),
s = colnames(m)[w],
d = m[cbind(seq_along(w), w)]
)
# q s d
# 1 ada zaa 2
# 2 aca ca 1
# 3 ara zaa 2
# 4 ava zaa 2
答案 3 :(得分:1)
这是另一个。也许更聪明的人可以想出一个单行,但这是我能做的最好的,妈妈
m <- structure(c(3, 1, 3, 3, 4, 4, 4, 4, 3, 5, 3, 5, 2, 2, 2, 2),.Dim = c(4L, 4L),.Dimnames = list(c("ada", "aca", "ara", "ava"), c("ca", "bsa", "rd", "zaa")))
(m2 <- data.frame(as.table(m * (apply(m, 1, min) == m))))
m2[m2$Freq > 0, ]
# Var1 Var2 Freq
# 2 aca ca 1
# 13 ada zaa 2
# 15 ara zaa 2
# 16 ava zaa 2