ID Date
1 1-1-2016
1 2-1-2016
1 3-1-2016
2 5-1-2016
3 6-1-2016
3 11-1-2016
3 12-1-2016
4 7-1-2016
5 9-1-2016
5 19-1-2016
5 20-1-2016
6 11-04-2016
6 12-04-2016
6 16-04-2016
6 04-08-2016
6 05-08-2016
6 06-08-2016
预期数据框基于成对的连续日期
1st_Date
是他第一次来访的时候
2nd_Date
是他连续2天访问的日期
3rd_Date
是他连续3天访问的日期
例如:
对于ID = 1
,他第一次访问1-1-2016
,他的连续两次访问也开始于1-1-2016
以及他的第3次访问。
同样地,对于ID = 2
,他只访问了一次,所以休息将保持空白
对于ID = 3
,他在6-1-2016
访问了第一次,但从11-1-2016
开始连续访问了两天。
NOTE : This has to be done till earliest 3rd Date only
预期产出
ID 1st_Date 2nd_Date 3rd_Date
1 1-1-2016 1-1-2016 1-1-2016
2 5-1-2016 NA NA
3 6-1-2016 11-1-2016 NA
4 7-1-2016 NA NA
5 9-1-2016 19-1-2016 NA
6 11-04-2016 11-04-2016 04-08-2016
答案 0 :(得分:0)
使用reshape
。以下代码假定z
是您的数据框,其中date
是数字日期/时间变量,越来越多地排序。
# a "set" variable represents a set of consecutive dates
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)
# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])
# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)
reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
idvar = "ID", v.names = "first.date", timevar = "occurrence")
结果:
ID first.date.1 first.date.2 first.date.3
1 1 2016-01-01 2016-01-01 2016-01-01
4 2 2016-01-05 <NA> <NA>
5 3 2016-01-06 2016-01-11 2016-01-11
8 4 2016-01-07 <NA> <NA>
9 5 2016-01-09 2016-01-09 2016-01-09
答案 1 :(得分:0)
以下是使用dplyr
和tidyr
的尝试。首先要做的是将您的Date
转换为as.Date
和group_by
ID。接下来我们创建一些新变量。第一个new
检查哪些日期是连续的。然后更新Date
以为这些连续日期提供NA。但是,如果并非所有日期都是连续的,那么我们会过滤掉那些转换为NA的日期。然后我们fill
(用每个ID的最新非日期替换NA),删除不需要的列和spread
。
library(dplyr)
library(tidyr)
df %>%
mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>%
group_by(ID) %>%
mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))),
Date = replace(Date, c(0, diff(new)) == 1, NA),
new1 = sum(is.na(Date)),
new2 = seq(n())) %>%
filter(!is.na(Date)|new1 != 1) %>%
fill(Date) %>%
select(-c(new, new1)) %>%
spread(new2, Date) %>%
select(ID:`3`)
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-09 2016-01-09
使用您更新的数据集,它提供
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-19 <NA>
使用数据
dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L,
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L,
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016",
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016",
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID",
"Date"), class = "data.frame", row.names = c(NA, -11L))