在R

时间:2018-08-07 16:05:15

标签: r

我有一个数据框,随着时间的推移会有不同的观察结果。一旦ID具有“匹配”的正值,则必须删除其后带有ID的行。这是一个示例数据框:

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  5    1
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  5    0
2018-06-08  6    1
2018-06-08  7    1
2018-06-08  8    1

所需的输出:

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  6    1
2018-06-08  8    1

换句话说,由于ID = 5在2018-06-06上具有正匹配项,因此在接下来的几天中将删除ID = 5的行,但保留与此ID的第一个正匹配项的行。

可复制的示例:

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)

提前谢谢

5 个答案:

答案 0 :(得分:4)

一种方法:

library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format

df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]

   ID       Date Match
1:  5 2018-06-06     1
2:  6 2018-06-06     0
3:  6 2018-06-07     0
4:  6 2018-06-08     1
5:  7 2018-06-07     1
6:  8 2018-06-08     1

我们要在第一个Match == 1之后删除行。

cumsum取Match的累计和。在第一个Match == 1之前为零。我们想保留下一行,因此用cumsum检查前一行的shift

答案 1 :(得分:4)

这是另一种方法,我们找出每个Match的最小行数ID = 1(即具有正匹配项的第一行),然后对其进行过滤:

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))

library(dplyr)

df %>%
  group_by(ID) %>%                                     # for each ID
  mutate(min_row = min(row_number()[Match == 1])) %>%  # get the first row where you have 1
  filter(row_number() <= min_row) %>%                  # keep previous rows and that row
  ungroup() %>%                                        # forget the grouping
  select(-min_row)                                     # remove unnecessary column

# # A tibble: 6 x 3
#   Date       ID    Match
#   <fct>      <fct> <fct>
# 1 2018-06-06 5     1    
# 2 2018-06-06 6     0    
# 3 2018-06-07 6     0    
# 4 2018-06-07 7     1    
# 5 2018-06-08 6     1    
# 6 2018-06-08 8     1  

您可以逐步运行代码以查看其工作方式。我创建了min_row列以帮助您理解。您可以将上面的内容改写为

df %>%
  group_by(ID) %>%                                    
  filter(row_number() <= min(row_number()[Match == 1])) %>%                
  ungroup()

答案 2 :(得分:2)

受到@Frank答案的启发

 library(dplyr)
 df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
        filter(Match==0 & Flag==0 | Match==1 & Flag==1)

 # A tibble: 6 x 4
 # Groups:   ID [4]
  Date       ID    Match  Flag
  <chr>      <chr> <chr> <dbl>
1 2018-06-06 5     1         1
2 2018-06-06 6     0         0
3 2018-06-07 6     0         0
4 2018-06-07 7     1         1
5 2018-06-08 6     1         1
6 2018-06-08 8     1         1

数据

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)

答案 3 :(得分:2)

我还有另一种方法可以使用dplyr

library(dplyr)
df %>% 
  group_by(ID) %>% 
  # You can use order(Date) if you don't want to coerce Date into date object
  mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>% 
  filter(ind <= first_match) %>%
  select(Date:Match)
# A tibble: 6 x 3
# Groups:   ID [4]
  Date          ID Match
  <chr>      <dbl> <dbl>
1 2018-06-06     5     1
2 2018-06-06     6     0
3 2018-06-07     6     0
4 2018-06-07     7     1
5 2018-06-08     6     1
6 2018-06-08     8     1

答案 4 :(得分:1)

这是另一个dplyr选项:

library(dplyr)  
df %>%
  mutate(Date = as.Date(Date)) %>% 
  group_by(ID) %>%
  mutate(first_match = min(Date[Match == 1])) %>% 
  filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>% 
  ungroup() %>% 
  select(-first_match)

# A tibble: 6 x 3
  Date       ID    Match
  <date>     <fct> <fct>
1 2018-06-06 5     1    
2 2018-06-06 6     0    
3 2018-06-07 6     0    
4 2018-06-07 7     1    
5 2018-06-08 6     1    
6 2018-06-08 8     1