根据连续出现次数填充日期

时间:2017-05-11 06:42:09

标签: r

ID     Date
1    1-1-2016
1    2-1-2016
1    3-1-2016
2    5-1-2016
3    6-1-2016
3    11-1-2016
3    12-1-2016
4    7-1-2016
5    9-1-2016
5    19-1-2016
5    20-1-2016
6    11-04-2016
6    12-04-2016
6    16-04-2016
6    04-08-2016
6    05-08-2016
6    06-08-2016

预期数据框基于成对的连续日期

1st_Date是他第一次来访的时候 2nd_Date是他连续2天访问的日期 3rd_Date是他连续3天访问的日期

例如: 对于ID = 1,他第一次访问1-1-2016,他的连续两次访问也开始于1-1-2016以及他的第3次访问。 同样地,对于ID = 2,他只访问了一次,所以休息将保持空白 对于ID = 3,他在6-1-2016访问了第一次,但从11-1-2016开始连续访问了两天。

NOTE : This has to be done till earliest 3rd Date only

预期产出

ID     1st_Date      2nd_Date      3rd_Date            
1      1-1-2016      1-1-2016      1-1-2016
2      5-1-2016         NA            NA
3      6-1-2016      11-1-2016        NA
4      7-1-2016         NA            NA
5      9-1-2016      19-1-2016        NA
6      11-04-2016    11-04-2016     04-08-2016

2 个答案:

答案 0 :(得分:0)

使用reshape。以下代码假定z是您的数据框,其中date是数字日期/时间变量,越来越多地排序。

# a "set" variable represents a set of consecutive dates 
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)

# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])

# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)

reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
    idvar = "ID", v.names = "first.date", timevar = "occurrence")

结果:

    ID first.date.1 first.date.2 first.date.3
  1  1   2016-01-01   2016-01-01   2016-01-01
  4  2   2016-01-05         <NA>         <NA>
  5  3   2016-01-06   2016-01-11   2016-01-11
  8  4   2016-01-07         <NA>         <NA>
  9  5   2016-01-09   2016-01-09   2016-01-09

答案 1 :(得分:0)

以下是使用dplyrtidyr的尝试。首先要做的是将您的Date转换为as.Dategroup_by ID。接下来我们创建一些新变量。第一个new检查哪些日期是连续的。然后更新Date以为这些连续日期提供NA。但是,如果并非所有日期都是连续的,那么我们会过滤掉那些转换为NA的日期。然后我们fill(用每个ID的最新非日期替换NA),删除不需要的列和spread

library(dplyr)
library(tidyr)

df %>% 
  mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>% 
  group_by(ID) %>% 
  mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))), 
         Date = replace(Date, c(0, diff(new)) == 1, NA), 
         new1 = sum(is.na(Date)), 
         new2 = seq(n())) %>% 
  filter(!is.na(Date)|new1 != 1) %>% 
  fill(Date) %>%  
  select(-c(new, new1)) %>% 
  spread(new2, Date) %>%
  select(ID:`3`)

#  ID        `1`        `2`        `3`
#* <int>     <date>     <date>     <date>
#1     1 2016-01-01 2016-01-01 2016-01-01
#2     2 2016-01-05       <NA>       <NA>
#3     3 2016-01-06 2016-01-11       <NA>
#4     4 2016-01-07       <NA>       <NA>
#5     5 2016-01-09 2016-01-09 2016-01-09

使用您更新的数据集,它提供

#     ID        `1`        `2`        `3`
#* <int>     <date>     <date>     <date>
#1     1 2016-01-01 2016-01-01 2016-01-01
#2     2 2016-01-05       <NA>       <NA>
#3     3 2016-01-06 2016-01-11       <NA>
#4     4 2016-01-07       <NA>       <NA>
#5     5 2016-01-09 2016-01-19       <NA>

使用数据

dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L, 
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L, 
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016", 
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016", 
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID", 
"Date"), class = "data.frame", row.names = c(NA, -11L))