在r中的固定日期之前提取最近的日期

时间:2017-10-13 05:50:08

标签: r dplyr data-manipulation lubridate

我有一个数据集Fragment,如下所示:

df

我想为每个df <- read.table(text = " ID INDEX_DATE DATE VALUE 1 14/06/2017 16/02/2015 7 1 14/06/2017 16/02/2015 6.5 1 14/06/2017 21/07/2015 7 1 14/06/2017 08/09/2015 9.5 1 14/06/2017 04/12/2015 8.9 1 14/06/2017 28/01/2016 8.4 1 14/06/2017 30/03/2016 8.2 1 14/06/2017 25/05/2016 7.6 1 14/06/2017 17/08/2016 8.2 1 14/06/2017 16/11/2016 8 1 14/06/2017 18/01/2017 8.8 1 14/06/2017 24/03/2017 9.1 1 14/06/2017 14/06/2017 7.9 1 14/06/2017 14/06/2017 8.0 2 10/01/2017 26/03/2015 7.4 2 10/01/2017 26/03/2015 7.7 2 10/01/2017 14/04/2016 7 2 10/01/2017 03/04/2017 7.6 3 14/01/2015 06/05/2015 7.2 3 14/01/2015 29/07/2015 7.1 3 14/01/2015 18/11/2015 7.2 3 14/01/2015 17/02/2016 7.3 3 14/01/2015 03/08/2016 7.3 4 08/12/2016 NA NA ", header = TRUE, stringsAsFactors = FALSE) 提取与VALUE之前最近的DATE对应的INDEX_DATE。如果ID之前没有DATE,则会使用INDEX_DATE之后的最近DATE

有3例:

  1. INDEX_DATEID == 1
  2. 之前DATE只有INDEX_DATE
  3. ID == 2:在DATE
  4. 之前和之后INDEX_DATE
  5. ID == 3DATE
  6. 后只有INDEX_DATE

    如果所选DATE的值超过1,我会随机选择其中任何一个。

    我想要的输出:

    df2 <- read.table(text = "
                     ID INDEX_DATE  DATE    VALUE
                     1  14/06/2017  14/06/2017  7.9
                     2  10/01/2017  14/04/2016  7
                     3  14/01/2015  06/05/2015  7.2
                     4  08/12/2016          NA  NA
                     ", header = TRUE, stringsAsFactors = FALSE)
    

    我尝试编写我的代码,但它不适用于第二种情况(即ID == 2):

    library(lubridate); library(dplyr)
    df2 <- df %>%
      mutate_at(vars(INDEX_DATE, DATE), funs(dmy)) %>%
      mutate(DATEDIFF = (INDEX_DATE - DATE)) %>%
      group_by(ID) %>%
      mutate(PRIORPOST = if_else(any(DATEDIFF >= 0), "PRIOR", "POST_ONLY"),
             CHOSEN_VALUE = if_else(PRIORPOST == "PRIOR" & DATEDIFF >= 0 & DATEDIFF == min(DATEDIFF), VALUE, 
                            if_else(PRIORPOST == "POST_ONLY" & DATEDIFF < 0 & DATEDIFF == max(DATEDIFF), VALUE, NA_real_))) %>%
      filter(!is.na(CHOSEN_VALUE))
    

1 个答案:

答案 0 :(得分:0)

试试这个,希望得到这个帮助:

library(dplyr)  
df$INDEX_DATE <- as.Date(df$INDEX_DATE, format="%d/%m/%Y")
df$DATE <- as.Date(df$DATE, format="%d/%m/%Y")
df$minDay <- abs(difftime(df$INDEX_DATE, df$DATE, units="days")) #Calculate the nearest day.

index <- 1
for (i in 1:length(unique(df$ID))) {
  temp <- subset(df, ID==unique(df$ID)[i])
  temp <- temp[temp$minDay == min(temp$minDay),] #Extract the minimum day of each ID.
  if (nrow(temp) > 1) {temp <- temp[sample(nrow(temp),1),]} else {temp <- temp} #Randomly pick 1 row.
  temp <- temp[, -5] #Remove minDay
  if (index == 1) { #Combine each ID's minimum row
    output <- temp
    index <- 0
  } else {
    output <- rbind(output, temp)
  }
}
output