Question

我正在使用R来分析一组患者的血液结果，电子健康记录输出所有患者的所有值，并记录他们的日期。

我想只为每位患者选择最新值。我已使用dplyr清理了数据，因此，如果有人知道使用dplyr实现此目标的方法，我将非常感激。

目前数据如下所示：

date, patient_id, value
13-01-2012, 345678,  13.2
23-06-2013, 345678,  10.3
12-02-2014, 345678,  9.6
1-03-2010, 789012,  22.3
28-02-2011, 789012,  10.3
6-04-2012, 789012,  8.2

我想选择的是：

date, patient_id, value
12-02-2014, 345678,  9.6
6-04-2012, 789012,  8.2

Answer 1

正如@Gregor所说，如果你的日期变量实际上是Date - 类对象，这很容易。

x <- read.csv(text="
date, patient_id, value
13-01-2012, 345678, 13.2
23-06-2013, 345678, 10.3
12-02-2014, 345678, 9.6
1-03-2010, 789012, 22.3
28-02-2011, 789012, 10.3
6-04-2012, 789012, 8.2",
colClasses=c("character","character","numeric"))

library("dplyr")
x %>% 
   ## convert to date
   mutate(date=as.Date(date,format="%d-%m-%Y")) %>%
   ## group by patient and take only most recent
   group_by(patient_id) %>% filter(date==max(date))

@Gregor指出（他现在删除了他的答案）

   order_by(desc(date)) %>% slice(1)

可以替代filter(date==max(date))（不确定效率是否存在显着差异）

如何按因子级别分组并仅选择最新值？

1 个答案: