Question

首先，这是我正在使用的示例数据：

ID BaselineScore MidScore Final Score
1  x             NA       NA 
1  NA            y        NA
1  NA            NA       z 
2  a             NA       NA 
2  NA            b        NA
2  NA            NA       c

我想要完成的是给定ID（ID == 1，ID == 2等），确定三个分数中的哪一个（基线，中间或最终）最大（即max（ x，y，z），max（a，b，c）等）。我有NAs的原因是因为我使用了来自tidyr的spread函数（某个时间点的得分变量最初是在更一般的得分变量下的行）。

我尝试使用基本R pmax函数，但只有在列之间有“水平”对齐值时才有效。

任何提示？

谢谢，

Answer 1

这是使用apply和max的基本解决方案，然后找到最大索引。

df <- read.csv(text="ID,BaselineScore,MidScore,Final Score
1,1,NA,NA
1,NA,2,NA
1,NA,NA,3
2,7,NA,NA
2,NA,6,NA
2,NA,NA,5")

fun_base <- function() {
    lapply(split(df, df$ID), function(x) {
        tmp <- apply(x[-1], 2, max, na.rm=TRUE)
        tmp[which.max(tmp)]
    })
}

fun_dplyr <- function() {
    df %>% 
        gather(Score_type, Score, -ID) %>% 
        group_by(ID) %>% 
        filter(Score==max(Score, na.rm=TRUE))
}

microbenchmark(
    fun_base(),
    fun_dplyr(),
    times=50L)

#Unit: microseconds
#        expr    min     lq     mean  median     uq    max neval
#  fun_base()  590.6  666.6  728.842  709.85  789.1 1060.1    50
# fun_dplyr() 2110.3 2318.3 2533.324 2442.75 2639.5 3663.4    50

Answer 2

我们可以将coalesce列合在一起，然后通过'ID'获取max

library(tidyverse)
df %>%
   transmute(ID, newCol = coalesce(BaselineScore, MidScore, FinalScore)) %>% 
   group_by(ID) %>%
   summarise(newCol = max(newCol))
# A tibble: 2 × 2
#      ID newCol
#   <int>  <chr>
#1     1      z
#2     2      c

或另一种选择是使用pmax和max

df %>% 
 transmute(ID, newCol = pmax(BaselineScore, MidScore, FinalScore, na.rm =TRUE)) %>% 
 group_by(ID) %>% 
 summarise(newCol = max(newCol))

在多个时间点查找给定主题的分数的最大值

2 个答案: