哦,男人,我感到愚蠢。这是初学者的核心,但我完全不知道如何在lapply
中弄清楚如何对参数进行子集化。在这一点上,我只是随机尝试[[
和朋友的不同组合,用RStudio中的调试诅咒我的笨拙。
问题:我有一个从SQL Server收集的数据集,其中包含几列日期数据。其中一些是合法的datetime
值,另一些是字符串。大多数都有(有效)缺失数据,有些则有多种格式。通常,日期1900-01-01
用作NULL
的替代。我正在努力尝试解决这个问题而不是惯用和简洁,而不是通过复制/粘贴调用来强制它。
如果逐列调用,我的ParseDates()
函数似乎运行良好,但我无法使用lapply
。当我只想传递当前观察时,我可以看到我发送了orders
和threshold
值的完整列表,但我无法理解lapply
如何迭代或者如何对齐多个列表,以便正确的参数与正确的调用一致。
我需要完成所有正确保存为日期(或在此实例中为POSIXct
)的值,并将1900-01-01
设置为NA
附近。
library(lubridate)
# build sample data extract
events <-
structure(
list(
ReservationDate = structure(
c(4L, 2L, 3L, NA,
1L), .Label = c(
"18/12/2006", "1/1/1900", "May 11 2004 12:00AM",
"May 17 2004 12:00AM"
), class = "factor"
), OrigEnquiryDate = structure(
c(1094565600,
937404000, 1089295200, NA, NA), class = c("POSIXct", "POSIXt"), tzone = ""
), UnconditionalDate = structure(
c(1092146400, 935676000,
1087740000, NA, 1168952400), class = c("POSIXct", "POSIXt"), tzone = ""
),
ContractsExchangedDate = structure(
c(NA, NA, NA, NA, 1171544400), class = c("POSIXct", "POSIXt"), tzone = ""
)
), .Names = c(
"ReservationDate",
"OrigEnquiryDate", "UnconditionalDate", "ContractsExchangedDate"
), row.names = c(54103L, 54090L, 54057L, 135861L, 73433L), class = "data.frame"
)
ParseDates <- function(x, orders=NULL, threshold=10) {
# converts to POSIXct if required and replaces 1900-01-01 or similar with na
if(!is.null(orders)) {
x <- parse_date_time(x, orders)
}
x[abs(difftime(x, as.POSIXct("1900-01-01"), units="days")) < threshold] <- NA
return(x)
}
# only consider these columns
date.cols <- names(events) %in% c(
"ReservationDate", "UnconditionalDate", "ContractsExchangedDate", "OrigEnquiryDate"
)
# columns other than these should use the default threshold of 10
date.thresholds <- list("UnconditionalDate"=90, "ContractsExchangedDate"=400)
# columns *other* than these should use the default order of NULL,
# they skip parsing and go straight to threshold testing
date.orders <- list(
"SettlementDate"=c("dmY", "bdY I:Mp"),
"ReservationDate"=c("dmY", "bdY I:Mp")
)
events[date.cols] <- lapply(events[date.cols],
ParseDates(events[date.cols],
orders = date.orders,
threshold = date.thresholds))