效率与R

时间:2017-07-09 05:11:47

标签: r lubridate

我有一个函数用于计算两个日期之间的差异。我有一个包含超过400K记录的数据集,而且我很难让它大规模地工作。

功能:

library(lubridate)
get_recency <- function(last_gift_date, refresh_date) {
  last_gift_date <- as.Date(last_gift_date)
  refresh_date <- as.Date(refresh_date)

  case_when(
    is.na(last_gift_date) ~ "ERROR",
    last_gift_date > refresh_date ~ "ERROR",
    last_gift_date %m+% months(12) >= refresh_date ~ "0-12",
    last_gift_date %m+% months(24) >= refresh_date ~ "13-24",
    last_gift_date %m+% months(36) >= refresh_date ~ "25-36",
    last_gift_date %m+% months(48) >= refresh_date ~ "37-48",
    last_gift_date %m+% months(60) >= refresh_date ~ "49-60",
    last_gift_date %m+% months(72) >= refresh_date ~ "61-72",
    last_gift_date %m+% months(84) >= refresh_date ~ "73-84",
    TRUE ~ "85+")
}

如果我将一个日期传递给refresh_date参数,它似乎执行得很好,但是当我传递等效长度的矢量时,它需要超长。

任何关于如何改进这一点的想法都将受到赞赏。

运行代码的示例:

a<- c("2014-01-29", "2015-04-07", "2015-04-10")
b<- c(NA, "2014-01-29", "2015-04-07")
get_recency(b,a)

# OUTPUT
#[1] "ERROR" "13-24" "0-12" 

更新 2017-07-10 我接受了@Akrun的建议并使用了cut()函数。它具有更快,更简洁的代码的好处。结果如下。

get_recency <- function(last_gift_date, refresh_date) {
  last_gift_date <- as.Date(last_gift_date)
  refresh_date <- as.Date(refresh_date)

  x <- (as.yearmon(refresh_date)-as.yearmon(last_gift_date))*12

  x <- replace(x, is.na(x), -Inf)

  cut(x, breaks = c(-Inf, -0.000001, 12, 24, 36, 48, 60, 72, 84, Inf), 
      labels = c("ERROR", "0-12", "13-24", "25-36", "37-48",
                 "49-60", "61-72", "73-84", "85+"),
      include.lowest = T)
}

1 个答案:

答案 0 :(得分:3)

library(lubridate)
library(dplyr)

a <- c("2014-01-29", "2015-04-07", "2015-04-10", "2025-04-10")
b <- c(NA, "2014-01-29", "2015-04-07", "2015-04-07")
intervals <- 12 * 1:7

get_recency <- function(last_gift_date, refresh_date, intervals) {


  last_gift_date <- as.Date(last_gift_date)
  refresh_date <- as.Date(refresh_date)

  intervals_chr <- c(
    "ERROR",
    paste(c(0, intervals[-length(intervals)] + 1), intervals, sep = "-"), 
    paste0(tail(intervals, 1) + 1, "+")
  )

  code <- sapply(c(0, intervals), function(n) {
    last_gift_date %m+% months(n) < refresh_date
  }) %>%
    rowSums()

  if_else(condition = is.na(code), true = "ERROR", 
          false = intervals_chr[code + 1])
}

get_recency(b, a, intervals)

[1] "ERROR" "13-24" "0-12"  "85+"  

这更快吗?