从最后一个非NA分组的R天

时间:2017-05-24 17:22:55

标签: r dplyr

我有一个如下所示的数据框:

df_raw <- structure(list(date = structure(c(17075, 17076, 17077, 17108, 
17109, 17110, 17111, 17112, 17113, 17221, 17222, 17223, 17224, 
17225, 17226, 17227, 17228, 17229, 17230, 17231, 17232, 17286, 
17075, 17076, 17077, 17078, 17079, 17080, 17081, 17082, 17083, 
17084, 17085, 17086, 17087, 17088, 17089, 17090, 17091), class = "Date"), 
    Req_BU = c("12018", "12018", "12018", "12018", "12018", "12018", 
    "12018", "12018", "12018", "12018", "12018", "12018", "12018", 
    "12018", "12018", "12018", "12018", "12018", "12018", "12018", 
    "12018", "12018", "14004", "14004", "14004", "14004", "14004", 
    "14004", "14004", "14004", "14004", "14004", "14004", "14004", 
    "14004", "14004", "14004", "14004", "14004"), last_rec_date = c(1L, 
    1L, 1L, 1L, 1L, NA, NA, 3L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 1L, 
    1L, 1L, NA, NA, 3L, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, NA, NA, 
    3L, 1L, 1L, 1L, 1L, NA, 2L, 1L)), .Names = c("date", "Req_BU", 
"last_rec_date"), row.names = c(NA, -39L), class = "data.frame")


> head(df_raw, 10)
         date Req_BU last_rec_date
1  2016-10-01  12018             1
2  2016-10-02  12018             1
3  2016-10-03  12018             1
4  2016-11-03  12018             1
5  2016-11-04  12018             1
6  2016-11-05  12018            NA
7  2016-11-06  12018            NA
8  2016-11-07  12018             3
9  2016-11-08  12018             1
10 2017-02-24  12018             1

> df_raw[22:30, ]
         date Req_BU last_rec_date
22 2017-04-30  12018             1
23 2016-10-01  14004            NA
24 2016-10-02  14004            NA
25 2016-10-03  14004             1
26 2016-10-04  14004             1
27 2016-10-05  14004             1
28 2016-10-06  14004             1
29 2016-10-07  14004             1
30 2016-10-08  14004            NA

我需要做的是将NA列中的last_rec_date值替换为自上次非NA以来的天数。这一切都需要基于名为Req_BU的分组变量来完成。我的数据从2016年10月1日开始,如果特定的Req_BUNA开头,那么我需要用1来填充,并继续这样做,直到有一个NA> head(df_hope, 10) date Req_BU last_rec_date 1 2016-10-01 12018 1 2 2016-10-02 12018 1 3 2016-10-03 12018 1 4 2016-11-03 12018 1 5 2016-11-04 12018 1 6 2016-11-05 12018 1 7 2016-11-06 12018 2 8 2016-11-07 12018 3 9 2016-11-08 12018 1 10 2017-02-24 12018 1 > df_hope[22:30, ] date Req_BU last_rec_date 22 2017-04-30 12018 1 23 2016-10-01 14004 1 24 2016-10-02 14004 1 25 2016-10-03 14004 1 26 2016-10-04 14004 1 27 2016-10-05 14004 1 28 2016-10-06 14004 1 29 2016-10-07 14004 1 30 2016-10-08 14004 1 值,此时正常逻辑接管。

我正在寻找这样的东西。

library(dplyr)
df_not_working <- df_raw %>%
  group_by(Req_BU) %>%
  mutate(last_rec_date = ifelse(is.na(last_rec_date), 
                                c(NA, diff(date)), 
                                  last_rec_date))

> df_not_working
Source: local data frame [39 x 3]
Groups: Req_BU [2]

# A tibble: 39 x 3
         date Req_BU last_rec_date
       <date>  <chr>         <dbl>
 1 2016-10-01  12018             1
 2 2016-10-02  12018             1
 3 2016-10-03  12018             1
 4 2016-11-03  12018             1
 5 2016-11-04  12018             1
 6 2016-11-05  12018             1
 7 2016-11-06  12018             1
 8 2016-11-07  12018             3
 9 2016-11-08  12018             1
10 2017-02-24  12018             1

我尝试了这个,但它甚至没有处理我需要的逻辑的第一部分。

dplyr

分析的其余部分非常var app = express(); app.listen(process.env.port); ,所以我可以使用它或基本解决方案(如果存在)。谢谢。

1 个答案:

答案 0 :(得分:1)

也许这会奏效吗?不是R-ish所以也许有人有更好的方法。

fill_na <- function(df, colname){
  x<- 1
  col <- as.character(colname)
  dfcol <- df[as.character(colname)]
  for(i in 1:nrow(dfcol)){
    ifelse(is.na(dfcol[i,col]), {
      df[i,col] = x
      x <- x + 1
    },
    x <- 1)
  }
  return(df)
}

df_hope <- unsplit(lapply(split(df_raw, f = df_raw$Req_BU), fill_na, colname = "last_rec_date"), f = df_raw$Req_BU)

编辑:为测试方法做了更清晰的示例:

example_df <- structure(list(date = structure(c(17075, 17076, 17077, 17108, 
17109, 17083, 17084, 17085, 17086, 17087), class = "Date"), Req_BU = c("12018", 
"12018", "12018", "12018", "12018", "14004", "14004", "14004", 
"14004", "14004"), last_rec_date = c(1L, 1L, 1L, NA, NA, NA, 
NA, NA, 1L, 1L)), .Names = c("date", "Req_BU", "last_rec_date"
), row.names = c(1L, 2L, 3L, 4L, 5L, 31L, 32L, 33L, 34L, 35L), class = "data.frame")

> example_df
         date Req_BU last_rec_date
1  2016-10-01  12018             1
2  2016-10-02  12018             1
3  2016-10-03  12018             1
4  2016-11-03  12018            NA
5  2016-11-04  12018            NA
31 2016-10-09  14004            NA
32 2016-10-10  14004            NA
33 2016-10-11  14004            NA
34 2016-10-12  14004             1
35 2016-10-13  14004             1

从NA值越过“Req_BU”12018和14004之间的“边界”的数据帧开始,将该数据帧“Req_BU”值拆分为独立数据帧列表。然后,在使用lapply返回单个数据框之前,使用unsplit将上述函数应用于每个单独的数据框。

df_ex <- unsplit(lapply(split(example_df, f = example_df$Req_BU), fill_na, colname = "last_rec_date"), f = example_df$Req_BU)

> df_ex
         date Req_BU last_rec_date
1  2016-10-01  12018             1
2  2016-10-02  12018             1
3  2016-10-03  12018             1
4  2016-11-03  12018             1
5  2016-11-04  12018             2
31 2016-10-09  14004             1
32 2016-10-10  14004             2
33 2016-10-11  14004             3
34 2016-10-12  14004             1
35 2016-10-13  14004             1