将行追加到data.table(作为参数传递给函数)

时间:2019-02-01 17:18:47

标签: r data.table

我正在将一些data.table传递给一个函数,并希望通过多个函数调用在传递的data.table中收集不断增长的结果。这些行将添加(附加)到函数中。

是否可以通过“引用/替换”将行附加到data.table

如果无法解决,是否有任何解决方法?

编辑: 我的目标是在函数中一次添加多行,并且行数可能非常大(这就是为什么我使用“ data.table”的原因)。 / em>

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

预期结果:

checked.rules
# rule.id  rule.name
# 1:       3  too late
# 2:       1   too big
# 3:       4     empty

valid.results
# err.code  msg
# 1:        3 some blah blah
# 2:        1 some blah blah
# 3:        4 some blah blah

2 个答案:

答案 0 :(得分:1)

如已经在由@Henrik提供目前data.table s不能通过引用添加行链路提及。因此,我会选择rbindlist(它也可以很好地添加多行):

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?
  checked.rules <<- rbindlist(list(checked.rules, rule))

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
  valid.result <<- rbindlist(list(valid.result, findings))
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

print(checked.rules)
print(valid.result)

答案 1 :(得分:1)

阅读了评论中的链接以及@ismirsehregal的建议使用list后,我最终使用了environment,以便可以“通过引用”收集多个结果。

我为两个变体做了基准测试

  1. rbind将每个函数调用结束时的中间结果转换为“累积”结果(“在函数内附加”)。

  2. 收集每个函数调用的中间结果,并且仅在末尾rbindlist收集一次(“追加到函数外部”)。

代码经过简化,从而产生了abt。 20个函数调用后出现9个mio行:

library(data.table)
library(microbenchmark)

validate.rbind <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  results$valid.result <- rbind(results$valid.result, findings) # same as: rbindlist(list(results$valid.result, findings))
}

validate.rbindlist <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  assign(paste0("res", sprintf("%02d", results$counter)), findings, envir = results)
  results$counter = results$counter + 1
}

microbenchmark(
  rbind.per.call = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$valid.result <- data.table(err.code = integer(0), msg = character(0))  # empty validation results table
    for (i in 1:20) {
      validate.rbind(data, results)
    }
  },
  rbindlist.once = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$counter      <- 1
    for (i in 1:20) {
      validate.rbindlist(data, results)
    }
    result.vars <- ls(envir = results, pattern = "^res.*")  # identify the result tables via the used naming pattern
    results$valid.result <- rbindlist(mget(result.vars, envir = results))
    rm(list = result.vars, envir = results)  # remove the intermediate result tables (keep only the total result)
  },
  times = 10)

解决方案2快四倍

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval
 rbind.per.call 1021.2956 1114.8187 1198.7033 1153.7775 1324.6672 1477.5669    10
 rbindlist.once  231.0477  249.7195  305.0974  260.2499  275.3446  713.1155    10

并且内存占用量(用gc()观察)甚至更好:

# Memory consumption for rbind.per.call:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   510152  27.3     940480  50.3   847768  45.3
# Vcells 19636460 149.9   55027624 419.9 52254173 398.7

# Memory consumption for rbindlist.once:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   604335  32.3    1168576  62.5   940480  50.3
# Vcells 19859703 151.6   55503896 423.5 39082073 298.2

PS:我没有测试链接的set变体,因为我不期望有更好的性能并且使用起来更复杂