如何找到R中不同行中两个日期之间的差异?

时间:2018-05-29 06:26:33

标签: r

我有一个类似于下面的数据框,其中包含我需要查看访问次数的日期。但是,对于1个唯一id,条件是,如果第一行的enddt和下一行的strdt之间的差异是<按降序排列后,我们应将其视为1次访问。

数据

 id      strdt         enddt    
 ep01    2017-06-23    2017-06-24  
 ep01    2017-06-28    2017-06-30
 ep01    2017-06-25    2017-06-26
 ep02    2017-05-06    2017-05-10
 ep02    2017-05-12    2017-05-14
 ep02    2017-05-15    2017-05-16  
 ep03    2017-05-15    2017-05-16
 ep04    2017-05-15    2017-05-17 

预期产出:

id     strdt         enddt  
ep01   2017-06-23    2017-06-26
ep01   2017-06-28    2017-06-30
ep02   2017-05-06    2017-05-10
ep02   2017-05-12    2017-05-16 
ep03   2017-05-15    2017-05-16
ep04   2017-05-15    2017-05-17

尝试

data = read.csv("data.csv",header = T,stringsAsFactors = F)
unique_id = unique(data$id)
id_data = NULL
for (i in 1: length(unique_id)){
id_data = data[data$id == unique_id[i],]  
id_data = id_data[ order(id_data$strdt , decreasing = F ),]
id_data = ifelse(id_data$enddt - id_data$str_dt < 1, id_data$enddt[2,3],id_data$enddt)   
 }

我尝试使用上面的代码,但我无法做到。提前致谢。

2 个答案:

答案 0 :(得分:1)

来自lead

dplyr功能可能对您的问题有所帮助。 https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/lead-lag

我还没有创建一个完全有效的解决方案,但逻辑可以从以下代码中推断出来

library("dplyr")
dat <- data.frame(id <- c("ep01", "ep01", "ep01", "ep02", "ep02", "ep02", "ep03", "ep04"),
                   startdt <- as.Date(c("2017-06-23", "2017-06-28", "2017-06-25", "2017-05-06", "2017-05-12", "2017-05-15", "2017-05-15", "2017-05-15")),
                   enddt <- as.Date(c("2017-06-24", "2017-06-30", "2017-06-26", "2017-05-10", "2017-05-14", "2017-05-16", "2017-05-16", "2017-05-17"))
)

colnames(dat) <- c("id", "startdt", "enddt")


# get next start date, you can use dplyr::group_by() to get next start date for each id
dat$start_lead <- lead(dat$startdt)

# calculate difference between next start date and current end date, if diff < 2, then reject otherwise accept
dat$is_less_thn_2 <- ifelse(dat$start_lead - dat$enddt < 2, 0, 1)

# get next diff value
dat$take_enddt_value <- lead(dat$is_less_thn_2)

# This part won't compile
for(i in 1:nrow(dat)) {
  # if take_enddt_value is 0, iterate until take_enddt_value is 1, set current enddt value to enddt with take_enddt_value = 1
  if (dat[i, "take_enddt_value"] == 0){
    k = i
    while(dat[k, "take_enddt_value"] == 0){
      k = k + 1
    }
    dat[i, "enddt"] <- dat[k, "enddt"]
  }
}

答案 1 :(得分:1)

另一种方法可以是对要组合的行进行分组以计算开始和放大。结束日期。请注意最终flag声明之前的group_by

library(dplyr)
library(data.table)

df %>%
  arrange(id, strdt) %>%
  group_by(id) %>%
  mutate(flag = as.numeric(strdt - lag(enddt, order_by = id, default = first(strdt)))) %>%
  mutate(flag = rleid(ifelse((flag < 2 & row_number() != 1) | lead(flag, order_by = id, default = 9999) < 2, 
                             9999, 
                             row_number()))) %>%  #final grouping happened here
  group_by(id, flag) %>%
  summarise(strdt = first(strdt),
            enddt = last(enddt)) %>%
  select(-flag)

输出为:

  id    strdt      enddt     
1 ep01  2017-06-23 2017-06-26
2 ep01  2017-06-28 2017-06-30
3 ep02  2017-05-06 2017-05-10
4 ep02  2017-05-12 2017-05-16
5 ep03  2017-05-15 2017-05-16
6 ep04  2017-05-15 2017-05-17

示例数据:

df <- structure(list(id = c("ep01", "ep01", "ep01", "ep02", "ep02", 
"ep02", "ep03", "ep04"), strdt = structure(c(17340, 17345, 17342, 
17292, 17298, 17301, 17301, 17301), class = "Date"), enddt = structure(c(17341, 
17347, 17343, 17296, 17300, 17302, 17302, 17303), class = "Date")), .Names = c("id", 
"strdt", "enddt"), row.names = c(NA, -8L), class = "data.frame")