Question

构建和评估针对数据集进行评估的各种条件表的最佳方法是什么？

例如，假设我想识别数据集中的无效行，如下所示：

library("data.table")

# notional example -- some observations are wrong, some missing
set.seed(1)
n = 100 # Number of customers.
        # Also included are "non-customers" where values except cust_id should be NA.
cust <- data.table( cust_id = sample.int(n+1),
                    first_purch_dt =
                      c(sample(as.Date(c(1:n, NA), origin="2000-01-01"), n), NA),
                    last_purch_dt = 
                      c(sample(as.Date(c(1:n, NA), origin="2000-04-01"), n), NA),
                    largest_purch_amt = 
                      c(sample(c(50:100, NA), n, replace=TRUE), NA),
                    last_purch_amt = 
                      c(sample(c(1:65,NA), n, replace=TRUE), NA)
                    )
setkey(cust, cust_id)

我想要检查每个观察的错误是last_purch_dt < first_purch_dt或largest_purch_amt < last_purch_amt的任何出现，以及除全部或无之外的任何缺失值。（对于非购买者而言，所有缺失都是可以的。）

我只想在一个条件表中store the expressions as strings而不是一系列硬编码表达式（这些表达式变得非常冗长且难以记录/维护）：

checks <- data.table( cond_id = c(1L:3L),
                      cond_txt = c("last_purch_dt < first_purch_dt",
                                  "largest_purch_amt < last_purch_amt",
                                  paste("( is.na(first_purch_dt) + is.na(last_purch_dt) +",
                                          "is.na(largest_purch_amt) +",
                                          "is.na(last_purch_amt) ) %% 4 != 0") # hacky XOR  
                                  ),
                      cond_msg = c("Error: last purchase prior to first purchase.",
                                   "Error: largest purchase less than last purchase.",
                                   "Error: partial transaction record.")
                     )

我知道我可以遍历各行条件并rbindlist生成的子集，例如：

err_obs <- 
  rbindlist(
    lapply(1:nrow(checks), function(i) {
      err_set <- cust[eval( parse(text= checks[i,cond_txt]) ) ,  ]
      cbind(err_set, 
            checks[i, .(err_id = rep.int(cond_id, times = nrow(err_set)),
                        err_msg = rep.int(cond_msg, times = nrow(err_set))
                        )]
            )                
    } )
  )
print(err_obs) # returns desired result

似乎在评估中正常工作并正确处理NA。

当我说“什么是最好的方式”时，我问：

这是最好的方法，还是rbindlist(lapply(...)更有效或惯用的替代方案？
我目前的做法是否存在陷阱？
这可以写成合并或加入，例如cust inner join checks on eval(checks.condition(cust.values)) == TRUE？

Answer 1

我就是这样做的：

checks[, cust[eval(parse(text = cond_txt), .SD)][, err_msg := cond_msg], by = cond_id]

上述唯一非常重要的部分是.SD的存在 - 请参阅this question进行解释。

使用R data.table进行表驱动的评估

1 个答案: