我想将数据表子集化,以根据日期和其他两列(id和类型变量)的条件值包含记录。但是,如果每个id只存在一条记录,则无论其他条件列或日期的值如何,都保留该记录。
我的数据样本如下:
dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"), location = c("training", "test", "training", "training", "test", "test", "training", "training"), date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), score = as.numeric(c(3,5,-1,0,1,3,-2,1)))
> dt
badge location date score
1: 1001 training 2014-09-21 3
2: 1001 test 2014-10-01 5
3: 1002 training 2014-09-20 -1
4: 1003 training 2014-09-15 0
5: 1003 test 2014-11-01 1
6: 1003 test 2014-12-10 3
7: 1004 training 2014-09-09 -2
8: 1004 training 2014-09-10 1
对于每个徽章,我对训练分数(第2行)的测试分数更感兴趣。但是,如果没有某个徽章的考试成绩,那么我想保留训练分数(第3行)。如果每个徽章存在多个测试分数,我想在较早的日期(第5行)取得分数。如果每个徽章存在多个训练分数但没有测试分数,我想在较晚的日期(第8行)获得分数。
结果应如下所示:
> dt
badge location date score
2: 1001 test 2014-10-01 5
3: 1002 training 2014-09-20 -1
5: 1003 test 2014-11-01 1
8: 1004 training 2014-09-10 1
我尝试过不同的dplyr字符串和子集的变体。 dt <- dt %>% group_by(badge) %>% filter(location=="test") %>% filter(date == min(date))
是我最接近的,因为它通过徽章给我最早的测试分数,但无论是否有该徽章的测试分数,都会删除所有培训记录。我可以看出为什么这段代码不起作用,因为我要求它有选择性,但我不知道如何让它更细致,以产生我想要的结果。
答案 0 :(得分:3)
我认为这是你想要的逻辑:
library(data.table)
myfunc <- function(x) {
if (!'test' %in% x$location) {
out <- setorder(x, -date)
} else {
out <- setorder(x, location, date)
}
out[1, ]
}
dt[, myfunc(.SD), by = 'badge']
# badge location date score
#1: 1003 test 2014-11-01 1
#2: 1001 test 2014-10-01 5
#3: 1002 training 2014-09-20 -1
#4: 1004 training 2014-09-10 1
我根据您的逻辑(订购data.table并返回第一行)创建了一个用户定义的函数,并在每个徽章组上使用。
答案 1 :(得分:2)
这是一种替代解决方案,它只需订购一次,以避免在分组时重复重新排序:
library(data.table)
tmp <- dt[order(date), if (any(location == "test"))
first(.I[location == "test"]) else last(.I), keyby = badge]
dt[tmp$V1]
badge location date score 1: 1001 test 2014-10-01 5 2: 1002 training 2014-09-20 -1 3: 1003 test 2014-11-01 1 4: 1004 training 2014-09-10 1
为了更好的解释,我已经介绍了tmp
,虽然这不是真正需要的。 tmp
保存V1
中所选记录的索引:
badge V1 1: 1001 2 2: 1002 3 3: 1003 5 4: 1004 8
答案 2 :(得分:1)
使用dplyr
的另一种可能的解决方案是使用filter
,join
和union_all
。
library(data.table)
library(dplyr)
dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"),
location = c("training", "test", "training", "training", "test", "test", "training", "training"),
date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")),
score = as.numeric(c(3,5,-1,0,1,3,-2,1)))
# Rows with badge having both "test" and "training". Data with "test" is preferred
df_test <- dt %>% filter(location == "test") %>%
inner_join(filter(dt, location == "training"), by="badge") %>%
select(badge, location = location.x, date = date.x, score = score.x)
# Data for badge with only "training" records
df_training <- dt %>% filter(location == "training") %>%
anti_join(filter(dt, location == "test"), by="badge")
# combine both
union_all(df_test, df_training)
# The result will look like:
> union_all(df_test, df_training)
badge location date score
1 1001 test 2014-10-01 5
2 1003 test 2014-11-01 1
3 1003 test 2014-12-10 3
4 1002 training 2014-09-20 -1
5 1004 training 2014-09-09 -2
6 1004 training 2014-09-10 1
不确定OP是否希望在duplicate
中保留same location
条记录。如果不需要重复记录,则可以使用distinct
过滤掉这些记录。