使用带有可变参数列表的lapply

时间:2016-02-04 06:04:37

标签: r subset lapply

哦,男人,我感到愚蠢。这是初学者的核心,但我完全不知道如何在lapply中弄清楚如何对参数进行子集化。在这一点上,我只是随机尝试[[和朋友的不同组合,用RStudio中的调试诅咒我的笨拙。

问题:我有一个从SQL Server收集的数据集,其中包含几列日期数据。其中一些是合法的datetime值,另一些是字符串。大多数都有(有效)缺失数据,有些则有多种格式。通常,日期1900-01-01用作NULL的替代。我正在努力尝试解决这个问题而不是惯用和简洁,而不是通过复制/粘贴调用来强制它。

如果逐列调用,我的ParseDates()函数似乎运行良好,但我无法使用lapply。当我只想传递当前观察时,我可以看到我发送了ordersthreshold值的完整列表,但我无法理解lapply如何迭代或者如何对齐多个列表,以便正确的参数与正确的调用一致。

我需要完成所有正确保存为日期(或在此实例中为POSIXct)的值,并将1900-01-01设置为NA附近。

library(lubridate)

# build sample data extract
events <-
  structure(
    list(
      ReservationDate = structure(
        c(4L, 2L, 3L, NA,
          1L), .Label = c(
            "18/12/2006", "1/1/1900", "May 11 2004 12:00AM",
            "May 17 2004 12:00AM"
          ), class = "factor"
      ), OrigEnquiryDate = structure(
        c(1094565600,
          937404000, 1089295200, NA, NA), class = c("POSIXct", "POSIXt"), tzone = ""
      ), UnconditionalDate = structure(
        c(1092146400, 935676000,
          1087740000, NA, 1168952400), class = c("POSIXct", "POSIXt"), tzone = ""
      ),
      ContractsExchangedDate = structure(
        c(NA, NA, NA, NA, 1171544400), class = c("POSIXct", "POSIXt"), tzone = ""
      )
    ), .Names = c(
      "ReservationDate",
      "OrigEnquiryDate", "UnconditionalDate", "ContractsExchangedDate"
    ), row.names = c(54103L, 54090L, 54057L, 135861L, 73433L), class = "data.frame"
  )

ParseDates <- function(x, orders=NULL, threshold=10) {
  # converts to POSIXct if required and replaces 1900-01-01 or similar with na
  if(!is.null(orders)) {
    x <- parse_date_time(x, orders)
  }
  x[abs(difftime(x, as.POSIXct("1900-01-01"), units="days")) < threshold] <- NA
  return(x)
}

# only consider these columns
date.cols <- names(events) %in% c(
  "ReservationDate", "UnconditionalDate", "ContractsExchangedDate", "OrigEnquiryDate"
)

# columns other than these should use the default threshold of 10
date.thresholds <- list("UnconditionalDate"=90, "ContractsExchangedDate"=400)

# columns *other* than these should use the default order of NULL,
#   they skip parsing and go straight to threshold testing
date.orders <- list(
  "SettlementDate"=c("dmY", "bdY I:Mp"),
  "ReservationDate"=c("dmY", "bdY I:Mp")
)

events[date.cols] <- lapply(events[date.cols],
                            ParseDates(events[date.cols],
                                       orders = date.orders,
                                       threshold = date.thresholds))

0 个答案:

没有答案