Ciao
这是我的复制示例。
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
高数据文件HAVE是我需要转换为WANT的文件。因此,基本上,对于每个ID,我需要标识第一个非零值,并且该值将进入数据文件WANT。如果所有缺勤值为NA,则TIME为NA。如果ABSENCE的所有值均为0,则我报告WANT中最后一个可能的行(反映在TIME变量中)
这是我的尝试:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
但是如果只有0,我不知道如何取0行中的最后一行。
答案 0 :(得分:4)
library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
我使用的是data.table,因为tidyverse不and never will不支持子分配/仅修改条件选择的行(如此处的is.na(ABSENCE)
)。
如果可以使两个规则彼此更一致,则应该在左连接或OP尝试的单个group_by + slice中实现。好的,这是一种方法,尽管看起来无法调试:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3
答案 1 :(得分:3)
基于子集.RLock()/Unlock()
行计数器,也使用data.table
:
.I
答案 2 :(得分:2)
这里有两种使用dplyr
和自定义函数的方法。两者都依赖于按TIME
排序的数据。
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
filter()
步骤将数据框缩小为所需的行。由于它不会修改TIME值,因此我们也需要mutate()
。
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
汇总的顺序很重要,因为我们要覆盖变量,因此这种方法存在风险。如果先总结WANT
,然后总结TIME
,它只会产生输出ABSENCE
。