我有一个数据框,随着时间的推移会有不同的观察结果。一旦ID具有“匹配”的正值,则必须删除其后带有ID的行。这是一个示例数据框:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 5 1
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 5 0
2018-06-08 6 1
2018-06-08 7 1
2018-06-08 8 1
所需的输出:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 6 1
2018-06-08 8 1
换句话说,由于ID = 5在2018-06-06上具有正匹配项,因此在接下来的几天中将删除ID = 5的行,但保留与此ID的第一个正匹配项的行。
可复制的示例:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)
提前谢谢
答案 0 :(得分:4)
一种方法:
library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format
df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]
ID Date Match
1: 5 2018-06-06 1
2: 6 2018-06-06 0
3: 6 2018-06-07 0
4: 6 2018-06-08 1
5: 7 2018-06-07 1
6: 8 2018-06-08 1
我们要在第一个Match == 1之后删除行。
cumsum
取Match的累计和。在第一个Match == 1之前为零。我们想保留下一行,因此用cumsum
检查前一行的shift
。
答案 1 :(得分:4)
这是另一种方法,我们找出每个Match
的最小行数ID
= 1(即具有正匹配项的第一行),然后对其进行过滤:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))
library(dplyr)
df %>%
group_by(ID) %>% # for each ID
mutate(min_row = min(row_number()[Match == 1])) %>% # get the first row where you have 1
filter(row_number() <= min_row) %>% # keep previous rows and that row
ungroup() %>% # forget the grouping
select(-min_row) # remove unnecessary column
# # A tibble: 6 x 3
# Date ID Match
# <fct> <fct> <fct>
# 1 2018-06-06 5 1
# 2 2018-06-06 6 0
# 3 2018-06-07 6 0
# 4 2018-06-07 7 1
# 5 2018-06-08 6 1
# 6 2018-06-08 8 1
您可以逐步运行代码以查看其工作方式。我创建了min_row
列以帮助您理解。您可以将上面的内容改写为
df %>%
group_by(ID) %>%
filter(row_number() <= min(row_number()[Match == 1])) %>%
ungroup()
答案 2 :(得分:2)
受到@Frank答案的启发
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
filter(Match==0 & Flag==0 | Match==1 & Flag==1)
# A tibble: 6 x 4
# Groups: ID [4]
Date ID Match Flag
<chr> <chr> <chr> <dbl>
1 2018-06-06 5 1 1
2 2018-06-06 6 0 0
3 2018-06-07 6 0 0
4 2018-06-07 7 1 1
5 2018-06-08 6 1 1
6 2018-06-08 8 1 1
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)
答案 3 :(得分:2)
我还有另一种方法可以使用dplyr
library(dplyr)
df %>%
group_by(ID) %>%
# You can use order(Date) if you don't want to coerce Date into date object
mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>%
filter(ind <= first_match) %>%
select(Date:Match)
# A tibble: 6 x 3
# Groups: ID [4]
Date ID Match
<chr> <dbl> <dbl>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1
答案 4 :(得分:1)
这是另一个dplyr
选项:
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(ID) %>%
mutate(first_match = min(Date[Match == 1])) %>%
filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>%
ungroup() %>%
select(-first_match)
# A tibble: 6 x 3
Date ID Match
<date> <fct> <fct>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1