Question

假设我有一个给定数量的样本的分数数字矩阵，ID1，ID2等。对于每个样本和每个观察，有三个分数标记为A，B和C.

# Mock data
foo <- data.frame(matrix(rexp(150, rate=.1), ncol=15))
foo <- rbind(c("A","B","C"), foo)
colnames(foo) <- rep(paste("ID", c(1:5), sep=""), rep(3, 5))
foo[1:5,1:6]

               ID1            ID1.1            ID1.2              ID2            ID2.1             ID2.2
1                A                B                C                A                B                 C
2 5.56521375011492 38.8443598688996 8.40060065458429 3.04018635302782 15.7668948165121  33.2867358608131
3 1.15913633337383 1.77594455424696  7.8426102303155 10.2102611125281 1.37656751088798  10.8752515162797
4 19.2305917284431 1.17383518721908 12.1561537865074 13.8317152201654 7.51959749455464  29.5795920474415
5    6.26116017811 1.45891858730465 26.5209942103679 1.67936608195305  18.366959299052 0.121995760127902

对于矩阵中的每个观察（行），我需要检查每个样本列出的三个分数并找到最大值。一旦找到最大值，我就将得分的标签（A，B或C）写入一个新的矩阵，这个矩阵是原始矩阵的三分之一。

我目前正在使用嵌套的for-loop来执行此操作，由于需要索引量，这显然非常低效。不过，以下是我目前实施的概要：

# Result matrix
res <- matrix(nrow=(nrow(foo) - 1), ncol=(ncol(foo) / 3))

# Iterate over observations
for (i in 2:nrow(foo)) {
    # Count columns in the row to track sample ID
    col = 1
    for (j in 1:ncol(res)) {
        index <- which.max(foo[i,col:(col + 2)])
        if (index == 1) {
            label <- "A"
        } else if (index == 2) {
            label <- "B"
        } else {
            label <- "C"
        }
        # Store labels of the maximum score for that observation and sample
        res[i - 1,j] <- label
        # Move to the next sample ID
        col <- col + 3
    }
}

所以，我试图至少部分这个过程来提高性能。到目前为止，我的尝试都围绕着为每个样本获取第一列索引的想法，以便将总工作量除以3：

# Get first index of each sample
ind <- seq(from=1, to=ncol(foo), by=3)
# Get index range of each sample as a list
ind <- lapply(ind, function(x) {
    seq(from=x, to=(x + 2), by=1)
})

这给出了每个样本出现的索引列表，但我不确定如何从这里开始。我编写的任何使用which.max的函数总是回到迭代行，然后遍历ind列表的成员。

有关如何进行的任何建议？是否存在我忽略的向量化函数，或者lapply是否比嵌套的for-loop更适合？

Answer 1

建议的数据结构

首先，标题和小标题并不是非常理想的自动化并且容易出问题。我将其分解为ID列表，每个都是三次运行的数据帧。（我们将添加ID1 最后命名约定，如果有必要的话。）

set.seed(1234)
foo1 <- lapply(1:5,
               function(id) data.frame(matrix(rexp(30, rate=.1), ncol=3)))
head(foo1[[1]], n=3)
##           X1       X2        X3
## 1 25.01758605 18.80077 19.962787
## 2  2.46758883 15.96105  7.283865
## 3  0.06581957 16.58662  3.835416

这大大方便了*apply系列功能。这第一个批处理对每个列表元素进行操作并将其转换为单个列

foo2 <- lapply(foo1, function(ff) apply(ff, 1, which.max))
head(foo2, n=2)
## [[1]]
##  [1] 1 2 2 2 2 3 2 3 2 3
## 
## [[2]]
##  [1] 3 2 2 3 3 2 1 1 3 3

现在很容易将这些组合成一个data.frame：

foo3 <- Reduce(cbind, foo2)
head(foo3, n=3)
##      init        
## [1,]    1 3 1 2 3
## [2,]    2 2 3 2 3
## [3,]    2 2 2 2 2

最后，让我们将其重新置于字母模式（如果必须），然后添加列名（再次，如果必须）：

foo4 <- apply(foo3, c(1,2), function(x) c('A','B','C')[x])
colnames(foo4) <- paste0('ID', seq.int(ncol(foo4)))
head(foo4, n=3)
##      ID1 ID2 ID3 ID4 ID5
## [1,] "A" "C" "A" "B" "C"
## [2,] "B" "B" "C" "B" "C"
## [3,] "B" "B" "B" "B" "B"

您的数据结构

假设我们必须使用它，我仍然会把它整齐地分解成元素大小，并继续*apply的东西：

# Mock data
set.seed(1234)
foo5 <- data.frame(matrix(rexp(150, rate=.1), ncol=15))
head(foo5[,1:5], n=3)
##            X1       X2        X3          X4        X5
## 1 25.01758605 18.80077 19.962787  4.34543487  1.291397
## 2  2.46758883 15.96105  7.283865  0.09091824 20.895804
## 3  0.06581957 16.58662  3.835416 16.10286033 25.188229

而不是尝试遍历所有内容，如何对子集进行子集化 data.frame成更小的块：

foo6 <- lapply(seq(1, ncol(foo5), by=3),
               function(ii) foo5[,ii:(ii+2)])

...然后使用上面的其余代码来完成剩下的工作。

foo7 <- Reduce(cbind,
               lapply(foo6, function(ff) apply(ff, 1, which.max)))
foo8 <- apply(foo7, c(1,2), function(x) c('A','B','C')[x])
colnames(foo8) <- paste0('ID', seq.int(ncol(foo8)))
head(foo8, n=3)
##      ID1 ID2 ID3 ID4 ID5
## [1,] "A" "C" "A" "B" "C"
## [2,] "B" "B" "C" "B" "C"
## [3,] "B" "B" "B" "B" "B"

（我处理其中一些问题的方式，如果是这样的话，我真的很喜欢它允许Rmd文件或至少完全降价。）

Answer 2

首先，您不应该rbind向您的数据框c('A', 'B', 'C')标签，因为这会导致foo中的所有数字成为字符串，而不是数字！保持它们是分开的（无论如何，你从不在代码中使用第一行foo。）

我可以想到几种方法来做到这一点，我确信还有其他一些我没想过的方法。

首先，我会创建一个与您相似的矩阵，但只是没有c('A', 'B', 'C')，这样我的数字实际上是数字，而不是字符。

foo <- data.frame(matrix(rexp(150, rate=.1), ncol=15))
labels <- c('A', 'B', 'C')
colnames(foo) <- make.unique(rep(paste("ID", c(1:5), sep=""), rep(3, 5)))

我能想到的第一种方法（相当直接） - 将数据框展平为矢量并找到每3个值的最大值，然后重新塑造成您想要的形状res。

foo.flat <- as.vector(t(foo)) # transpose as R is column-wise and I want row-wise
# split(foo.flat, ceiling(1:length(foo.flat)/3)) # splits into chunks of 3, so:
ms <- vapply(split(foo.flat, ceiling(1:length(foo.flat)/3)),
             which.max, # function to apply to each chunk of 3
             -1, # template value for vapply
             USE.NAMES=F)

现在只需将1转换为A，将2转换为B，将3转换为C并重新转换为矩阵（res）：

res <- matrix(labels[ms], byrow=T, ncol=ncol(foo)/3)

我能想到的第二种方法 - 将矩阵重新整形为长形（reshape2）并使用plyr对每个（行，ID）进行计算。（可能更优雅，但更令人困惑？，由你决定）

foo$observation <- 1:nrow(foo)
library(reshape2)
foo.long <- melt(foo, id='observation', variable.name='ID')
# fix IDs, i.e. ID1.2 --> ID1 etc
foo.long$ID <- gsub('\\.[1-9]+$', '', foo.long$ID)
# > head(foo.long[order(foo.long$observation, foo.long$ID),])
#    observation  ID     value
# 1            1 ID1 15.751959
# 11           1 ID1 20.386724
# 21           1 ID1  9.423799
# 31           1 ID2  4.560623
# 41           1 ID2  1.140642
# 51           1 ID2 37.009728

observation只是每个号码来自的行，ID是ID。现在为每个（观察，变量）找到最大值的索引。

library(plyr)
intermediate <- ddply(foo.long, .(observation, ID), function (x) which.max(x$value))
> head(intermediate)
#  observation  ID V1
# 1           1 ID1  2
# 2           1 ID2  3
# 3           1 ID3  3
# 4           1 ID4  2
# 5           1 ID5  3
# 6           2 ID1  1

现在只需将V1列重新整形为矩阵（将索引转换为标签）

res <- matrix(labels[intermediate$V1], byrow=T, ncol=floor(ncol(foo)/3)))

您也可以使用data.table执行类似操作，根据矩阵的大小，这可能会更快。

Answer 3

我认为这个问题看起来很难，因为您的数据范围很广。我会首先使用reshape2，然后它看起来不那么难，我们可以使用which.max来完成工作：

foo <- data.frame(matrix(rexp(150, rate=.1), ncol=15))
foo <- rbind(c("A","B","C"), foo)
colnames(foo) <- paste0("ID", rep(1:5, each=3), rep(LETTERS[1:3], times=5))

require(reshape2)

#make an id variable
foo$id <- 1:nrow(foo)

foo.melt <- melt(foo, "id")

#take apart ID1A into two seperate variables
foo.melt$num <- rep(1:5, each=3)[foo.melt$variable]
foo.melt$rep <- rep(1:3, times=5)[foo.melt$variable]

res <- do.call(rbind, by(foo.melt, interaction(foo.melt$id, foo.melt$num),
       function(x) {
           id <- x[1,"id"]
           num <- x[1,"num"]
           #which.max gets us the index of the max, look it up and get a letter.
           type <- LETTERS[x[which.max(x$value), "rep"]]
           data.frame(id=id, num=num, type=type);
       }
       )
)
dcast(res, id~num)

给我们：

R>dcast(res, id~num)
Using type as value column: use value.var to override.
   id 1 2 3 4 5
1   1 A C A A B
2   2 C A B C C
3   3 C B A A B
4   4 B C C A C
5   5 A C B B C
6   6 A B A C B
7   7 B B B A A
8   8 A C A A B
9   9 A B C C B
10 10 A B C A B

用于矩阵子集比较的for循环的矢量化

3 个答案:

建议的数据结构

您的数据结构