使用dplyr将开始日期和结束日期扩展到不平衡的每月面板

时间:2018-05-21 02:25:11

标签: r dplyr plyr tidyverse

我有想要扩展到每月面板的活动的开始和结束日期,我想知道dplyr中是否有任何工具可以解决此问题。以下代码执行我想要对ddply()执行的操作。它首先创建一个示例tibble data.frame(称为"宽"),其中" id"代表个人和" HomeNum"是个人的事件。下一行创建了一个" date"变量是来自" StartDate"的月度系列; to" FinishDate"在每个" id" by" HomeNum"基。

library(plyr)
library(dplyr)
library(tibble)
wide = 
    tibble(
        id = c(1, 1, 2, 2, 2),
        HomeNum = c(0,1,0,1,2),
        StartDate = as.Date(c("2001-01-01", "2001-03-01", "2000-04-01", "2001-02-01", "2002-08-01")),
        FinishDate = as.Date(c("2001-02-01", "2002-05-01", "2001-01-01", "2002-07-01", "2002-12-01"))
    )
panel = 
    ddply(wide, 
          ~id+HomeNum, 
          transform, 
          date = seq.Date(StartDate, FinishDate, by = "month")
    )

我认为dplyr作为" plyr"的下一次迭代,必须有某种方法来实现类似的解决方案(并输出tibble ),但以下不起作用:

panel = 
    wide %>% 
    group_by(id, HomeNum) %>% 
    mutate(date = seq.Date(StartDate, FinishDate, by = "month")) 

并返回

Error in mutate_impl(.data, dots) :
    Column `date` must be length 1 (the group size), not 2

坦率地说,我很惊讶ddply()解决方案有效并且不会产生类似的错误。

我使用ddply()的实施与this question的答案类似。

2 个答案:

答案 0 :(得分:3)

您可以将date的元素强制转换为列表和unnest

library(tidyverse)
wide %>%
  group_by(id, HomeNum) %>%
  mutate(date = list(seq.Date(StartDate, FinishDate, by = "month"))) %>%
  unnest(date)

答案 1 :(得分:1)

unnest的早期版本中,在日期列表中使用tidyr是一个问题。我得到了同样的错误并找到了解决方法,但是一旦我更新到tidyr 0.8.1,就不再需要解决方法了。这是一个在GitHub上的一些问题中记录的问题 - #407#450是我看过的问题。

如果您的版本无法取消日期,您可以通过将日期转换为字符串,删除,然后将字符串转换回日期来构建@ hpesoj626的答案。

library(tidyverse)

wide <- tibble(
    id = c(1, 1, 2, 2, 2),
    HomeNum = c(0,1,0,1,2),
    StartDate = as.Date(c("2001-01-01", "2001-03-01", "2000-04-01", "2001-02-01", "2002-08-01")),
    FinishDate = as.Date(c("2001-02-01", "2002-05-01", "2001-01-01", "2002-07-01", "2002-12-01"))
  )

# with previous versions of tidyr
wide %>%
  group_by(id, HomeNum) %>%
  mutate(date = list(seq.Date(StartDate, FinishDate, by = "month") %>% as.character())) %>%
  tidyr::unnest() %>%
  mutate(date = as.Date(date))
#> # A tibble: 50 x 5
#> # Groups:   id, HomeNum [5]
#>       id HomeNum StartDate  FinishDate date      
#>    <dbl>   <dbl> <date>     <date>     <date>    
#>  1     1       0 2001-01-01 2001-02-01 2001-01-01
#>  2     1       0 2001-01-01 2001-02-01 2001-02-01
#>  3     1       1 2001-03-01 2002-05-01 2001-03-01
#>  4     1       1 2001-03-01 2002-05-01 2001-04-01
#>  5     1       1 2001-03-01 2002-05-01 2001-05-01
#>  6     1       1 2001-03-01 2002-05-01 2001-06-01
#>  7     1       1 2001-03-01 2002-05-01 2001-07-01
#>  8     1       1 2001-03-01 2002-05-01 2001-08-01
#>  9     1       1 2001-03-01 2002-05-01 2001-09-01
#> 10     1       1 2001-03-01 2002-05-01 2001-10-01
#> # ... with 40 more rows

否则,他们发布的解决方案应该有效:

# with tidyr 0.8.1
wide %>%
  group_by(id, HomeNum) %>%
  mutate(date = list(seq.Date(StartDate, FinishDate, by = "month"))) %>%
  tidyr::unnest()
#> # A tibble: 50 x 5
#> # Groups:   id, HomeNum [5]
#>       id HomeNum StartDate  FinishDate date      
#>    <dbl>   <dbl> <date>     <date>     <date>    
#>  1     1       0 2001-01-01 2001-02-01 2001-01-01
#>  2     1       0 2001-01-01 2001-02-01 2001-02-01
#>  3     1       1 2001-03-01 2002-05-01 2001-03-01
#>  4     1       1 2001-03-01 2002-05-01 2001-04-01
#>  5     1       1 2001-03-01 2002-05-01 2001-05-01
#>  6     1       1 2001-03-01 2002-05-01 2001-06-01
#>  7     1       1 2001-03-01 2002-05-01 2001-07-01
#>  8     1       1 2001-03-01 2002-05-01 2001-08-01
#>  9     1       1 2001-03-01 2002-05-01 2001-09-01
#> 10     1       1 2001-03-01 2002-05-01 2001-10-01
#> # ... with 40 more rows

另一种选择是将gather数据转换为长格式,其中观察结果显示type列,显示它是开始日期还是结束日期。然后使用complete填写每个组的最短和最长日期之间的缺失日期。收集会保留type列,该列会以NA填写添加的日期。然后,如果它不再有用,您可以删除type列。

wide %>%
  gather(key = type, value = date, StartDate, FinishDate) %>%
  group_by(id, HomeNum) %>%
  complete(date = seq.Date(min(date), max(date), by = "month"))
#> # A tibble: 50 x 4
#> # Groups:   id, HomeNum [5]
#>       id HomeNum date       type      
#>    <dbl>   <dbl> <date>     <chr>     
#>  1     1       0 2001-01-01 StartDate 
#>  2     1       0 2001-02-01 FinishDate
#>  3     1       1 2001-03-01 StartDate 
#>  4     1       1 2001-04-01 <NA>      
#>  5     1       1 2001-05-01 <NA>      
#>  6     1       1 2001-06-01 <NA>      
#>  7     1       1 2001-07-01 <NA>      
#>  8     1       1 2001-08-01 <NA>      
#>  9     1       1 2001-09-01 <NA>      
#> 10     1       1 2001-10-01 <NA>      
#> # ... with 40 more rows

reprex package(v0.2.0)创建于2018-05-22。