构建和评估针对数据集进行评估的各种条件表的最佳方法是什么?
例如,假设我想识别数据集中的无效行,如下所示:
library("data.table")
# notional example -- some observations are wrong, some missing
set.seed(1)
n = 100 # Number of customers.
# Also included are "non-customers" where values except cust_id should be NA.
cust <- data.table( cust_id = sample.int(n+1),
first_purch_dt =
c(sample(as.Date(c(1:n, NA), origin="2000-01-01"), n), NA),
last_purch_dt =
c(sample(as.Date(c(1:n, NA), origin="2000-04-01"), n), NA),
largest_purch_amt =
c(sample(c(50:100, NA), n, replace=TRUE), NA),
last_purch_amt =
c(sample(c(1:65,NA), n, replace=TRUE), NA)
)
setkey(cust, cust_id)
我想要检查每个观察的错误是last_purch_dt < first_purch_dt
或largest_purch_amt < last_purch_amt
的任何出现,以及除全部或无之外的任何缺失值。 (对于非购买者而言,所有缺失都是可以的。)
我只想在一个条件表中store the expressions as strings而不是一系列硬编码表达式(这些表达式变得非常冗长且难以记录/维护):
checks <- data.table( cond_id = c(1L:3L),
cond_txt = c("last_purch_dt < first_purch_dt",
"largest_purch_amt < last_purch_amt",
paste("( is.na(first_purch_dt) + is.na(last_purch_dt) +",
"is.na(largest_purch_amt) +",
"is.na(last_purch_amt) ) %% 4 != 0") # hacky XOR
),
cond_msg = c("Error: last purchase prior to first purchase.",
"Error: largest purchase less than last purchase.",
"Error: partial transaction record.")
)
我知道我可以遍历各行条件并rbindlist
生成的子集,例如:
err_obs <-
rbindlist(
lapply(1:nrow(checks), function(i) {
err_set <- cust[eval( parse(text= checks[i,cond_txt]) ) , ]
cbind(err_set,
checks[i, .(err_id = rep.int(cond_id, times = nrow(err_set)),
err_msg = rep.int(cond_msg, times = nrow(err_set))
)]
)
} )
)
print(err_obs) # returns desired result
似乎在评估中正常工作并正确处理NA
。
当我说“什么是最好的方式”时,我问:
rbindlist(lapply(...)
更有效或惯用的替代方案? cust inner join checks on eval(checks.condition(cust.values)) == TRUE
?答案 0 :(得分:3)
我就是这样做的:
checks[, cust[eval(parse(text = cond_txt), .SD)][, err_msg := cond_msg], by = cond_id]
上述唯一非常重要的部分是.SD
的存在 - 请参阅this question进行解释。