对列事件进行分组以创建事件的开始和结束日期

时间:2017-12-18 15:48:53

标签: r

我是编程和R的新手。我有点卡住了。我有以下数据表。

Date        |ONIstatus  
01/10/1993  |Average  
01/11/1993  |Average  
01/12/1993  |Average    
01/01/1994  |Average    
01/02/1994  |High    
01/03/1994  |High  
01/04/1994  |High  
01/05/1994  |High  
01/06/1994  |Low  
01/07/1994  |Low 
01/08/1994  |Average  
01/09/1994  |Average  
01/10/1994  |Average    
01/11/1994  |Average    
01/12/1994  |High    
01/01/1995  |High  
01/02/1995  |Low  
01/03/1995  |Low  
01/04/1995  |Low  
01/05/1995  |Low   

我想根据' ONIstatus'中的事件序列提取开始日期和结束日期。柱。因此,开始日期将是第一组' ONIstatus条目'并且结束日期将是下一个序列开始的时间 - 因此,例如,前几组结果所需的输出将是

Start Date  | End Date   | ONIstatus  
01/10/1993  | 01/02/1994 | Average  
01/02/1994  | 01/06/1994 | High
01/06/1994  | 01/08/1994 | Low  
01/08/1994  | 01/12/1994 | Average
01/12/1994  | 01/02/1995 | High

等等......我想遍历整个数据集,该数据集有几百个条目。

我一直试图用Dplyr和rle来做这件事,但没有太多运气

2 个答案:

答案 0 :(得分:0)

我们可以使用tidyverse

library(dplyr)
library(lubridate)
df1 %>%
    mutate(Date = dmy(Date)) %>%
    group_by(ONIstatus) %>% 
    summarise(StartDate = min(Date), EndDate = max(Date)) %>%         
    mutate(EndDate = lead(StartDate)) %>%
    na.omit() %>%
    mutate_at(2:3, funs(format(., "%d/%m/%Y"))) %>%        
    select(StartDate, EndDate, ONIstatus)
# A tibble: 2 x 3
#   StartDate    EndDate ONIstatus
#       <chr>      <chr>     <chr>
#1 01/10/1993 01/02/1994   Average
#2 01/02/1994 01/06/1994      High

答案 1 :(得分:0)

希望这有帮助!

s <- rle(as.character(df$ONIstatus))
df_final <- data.frame(ONIstatus = s$values, length = s$lengths)

#end index
df_final$end <- cumsum(df_final$length)
df_final$desired_end <- df_final$end +1
#start index
df_final$start <- df_final$end - df_final$length + 1

#start_date & end_date calculation based on start & end index
df_final$start_date <- df$Date[df_final$start]
df_final$end_date <- df$Date[df_final$desired_end]

#final output
df_final <- na.omit(df_final[,c('ONIstatus','start_date','end_date')])
df_final

输出是:

  ONIstatus start_date   end_date
1   Average 01/10/1993 01/02/1994
2      High 01/02/1994 01/06/1994
3       Low 01/06/1994 01/08/1994
4   Average 01/08/1994 01/12/1994
5      High 01/12/1994 01/02/1995


#sample data
> dput(df)
structure(list(Date = structure(c(15L, 17L, 19L, 1L, 3L, 5L, 
7L, 9L, 11L, 12L, 13L, 14L, 16L, 18L, 20L, 2L, 4L, 6L, 8L, 10L
), .Label = c("01/01/1994", "01/01/1995", "01/02/1994", "01/02/1995", 
"01/03/1994", "01/03/1995", "01/04/1994", "01/04/1995", "01/05/1994", 
"01/05/1995", "01/06/1994", "01/07/1994", "01/08/1994", "01/09/1994", 
"01/10/1993", "01/10/1994", "01/11/1993", "01/11/1994", "01/12/1993", 
"01/12/1994"), class = "factor"), ONIstatus = structure(c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("Average", "High", "Low"), class = "factor")), .Names = c("Date", 
"ONIstatus"), class = "data.frame", row.names = c(NA, -20L))