Question

我想将数据表子集化，以根据日期和其他两列（id和类型变量）的条件值包含记录。但是，如果每个id只存在一条记录，则无论其他条件列或日期的值如何，都保留该记录。

我的数据样本如下：

dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"), location = c("training", "test", "training", "training", "test", "test", "training", "training"), date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), score = as.numeric(c(3,5,-1,0,1,3,-2,1)))

> dt
   badge location       date score
1:  1001 training 2014-09-21     3
2:  1001     test 2014-10-01     5
3:  1002 training 2014-09-20    -1
4:  1003 training 2014-09-15     0
5:  1003     test 2014-11-01     1
6:  1003     test 2014-12-10     3
7:  1004 training 2014-09-09    -2
8:  1004 training 2014-09-10     1

对于每个徽章，我对训练分数（第2行）的测试分数更感兴趣。但是，如果没有某个徽章的考试成绩，那么我想保留训练分数（第3行）。如果每个徽章存在多个测试分数，我想在较早的日期（第5行）取得分数。如果每个徽章存在多个训练分数但没有测试分数，我想在较晚的日期（第8行）获得分数。

结果应如下所示：

> dt
   badge location       date score
2:  1001     test 2014-10-01     5
3:  1002 training 2014-09-20    -1
5:  1003     test 2014-11-01     1
8:  1004 training 2014-09-10     1

我尝试过不同的dplyr字符串和子集的变体。 dt <- dt %>% group_by(badge) %>% filter(location=="test") %>% filter(date == min(date))是我最接近的，因为它通过徽章给我最早的测试分数，但无论是否有该徽章的测试分数，都会删除所有培训记录。我可以看出为什么这段代码不起作用，因为我要求它有选择性，但我不知道如何让它更细致，以产生我想要的结果。

Answer 1

我认为这是你想要的逻辑：

library(data.table)
myfunc <- function(x) {
 if (!'test' %in% x$location) {
  out <- setorder(x, -date)
 } else {
  out <- setorder(x, location, date)
 }
 out[1, ]
}

dt[, myfunc(.SD), by = 'badge']
#   badge location       date score
#1:  1003     test 2014-11-01     1
#2:  1001     test 2014-10-01     5
#3:  1002 training 2014-09-20    -1
#4:  1004 training 2014-09-10     1

我根据您的逻辑（订购data.table并返回第一行）创建了一个用户定义的函数，并在每个徽章组上使用。

Answer 2

这是一种替代解决方案，它只需订购一次，以避免在分组时重复重新排序：

library(data.table)
tmp <- dt[order(date), if (any(location == "test")) 
  first(.I[location == "test"]) else last(.I), keyby = badge]
dt[tmp$V1]

   badge location       date score
1:  1001     test 2014-10-01     5
2:  1002 training 2014-09-20    -1
3:  1003     test 2014-11-01     1
4:  1004 training 2014-09-10     1

为了更好的解释，我已经介绍了tmp，虽然这不是真正需要的。 tmp保存V1中所选记录的索引：

   badge V1
1:  1001  2
2:  1002  3
3:  1003  5
4:  1004  8

Answer 3

使用dplyr的另一种可能的解决方案是使用filter，join和union_all。

library(data.table)
library(dplyr)


    dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"),
location = c("training", "test", "training", "training", "test", "test", "training", "training"), 
date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), 
score = as.numeric(c(3,5,-1,0,1,3,-2,1)))


        # Rows with badge having both "test" and "training". Data with "test" is preferred
        df_test <- dt %>% filter(location == "test") %>%
        inner_join(filter(dt, location == "training"), by="badge") %>%
        select(badge, location = location.x, date = date.x, score = score.x)

        # Data for badge with only "training" records
        df_training <- dt %>% filter(location == "training") %>%
          anti_join(filter(dt, location == "test"), by="badge")

        # combine both
        union_all(df_test, df_training)

        # The result will look like:
        > union_all(df_test, df_training)
          badge location       date score
        1  1001     test 2014-10-01     5
        2  1003     test 2014-11-01     1
        3  1003     test 2014-12-10     3
        4  1002 training 2014-09-20    -1
        5  1004 training 2014-09-09    -2
        6  1004 training 2014-09-10     1

不确定OP是否希望在duplicate中保留same location条记录。如果不需要重复记录，则可以使用distinct过滤掉这些记录。

如果列值重复，则根据多个条件保留行，否则保持行

3 个答案: