确定开始日期,结束日期,连续数字的运行长度,并转换为新的数据框架

时间:2016-02-02 19:31:35

标签: r date

我有一组看起来像这样的数据:

          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1

由此,我试图创建一个新的数据框,它将显示我的1次运行的日期以及这些运行的时间长度,列标题:1)开始日期,2)结束日期, 3)跑步的长度。

最终,我想根据上面的数据创建一个看起来像这样的数据框:

  Start Date   End Date  Length of Run
1 2006-06-01 2006-06-03              3
2 2006-06-06 2006-06-07              2  

我的数据中有一些NA,我需要在整个数据中忽略它。

3 个答案:

答案 0 :(得分:4)

您可以使用dplyr执行此操作,使用mutate将缺失的boolean值转换为0,group_by以计算具有常量变量boolean值的组,filter限制boolean设置为1的组以及组中有多个成员的组,然后summarize获取相关的摘要信息。 (我采取了一些额外的步骤来删除最后的分组变量。)

library(dplyr)
dat %>%
  mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
  group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
  filter(boolean == 1 & n() > 1) %>%
  summarize("Start Date"=min(as.character(Date)),
            "End Date"=max(as.character(Date)),
            "Length of Run"=n()) %>%
  ungroup() %>%
  select(-matches("group"))
#   Start Date   End Date Length of Run
#        (chr)      (chr)         (int)
# 1 2006-06-01 2006-06-03             3
# 2 2006-06-06 2006-06-07             2

数据:

dat <- read.table(text="          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1", header=T)

答案 1 :(得分:2)

我们还可以根据需要使用data.table对数据进行子集化和转换。首先,我们使用id创建rleid(boolean)列。接下来,根据必要条件对数据进行子集化。最后,我们使用子集化数据创建startendrun

library(data.table)
setDT(dat)[,id := rleid(boolean)][
  ,.SD[.N > 1 & boolean == 1],id][
  ,.(start=Date[1],end=Date[.N], run=.N),id]
#   id      start        end run
#1:  1 2006-06-01 2006-06-03   3
#2:  4 2006-06-06 2006-06-07   2

答案 2 :(得分:1)

使用base的另一个答案,重新格式化this answer使用cumsumdiff

#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]

x1$Date <- as.Date(x1$Date)  #Convert date from factor to Date class

#Starting at 0, if the difference between eligible dates is >1 day, 
#   add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <-  cumsum(c(0, diff(x1$Date) > 1))

#          Date boolean SeriesNo
#407 2006-06-01       1        0
#408 2006-06-02       1        0
#409 2006-06-03       1        0
#412 2006-06-06       1        1
#413 2006-06-07       1        1
#415 2006-06-09       1        2

# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <-  as.data.frame(as.list(
         aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz) 
         c(Start = min(zz), End= max(zz), Run = length(zz) ))
       )) #see note after this code block

#Output is in days since origin.  Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End   <- as.Date(x2$Date.End,   origin = "1970-01-01")

#  SeriesNo Date.Start   Date.End Date.Run
#1        0 2006-06-01 2006-06-03        3
#2        1 2006-06-06 2006-06-07        2
#3        2 2006-06-09 2006-06-09        1

关于&#34;马车&#34;来自aggregate的输出:Using aggregate to apply several functions on several variables in one call