我正在将一些data.table
传递给一个函数,并希望通过多个函数调用在传递的data.table
中收集不断增长的结果。这些行将添加(附加)到函数中。
是否可以通过“引用/替换”将行附加到data.table
?
如果无法解决,是否有任何解决方法?
编辑: 我的目标是在函数中一次添加多行,并且行数可能非常大(这就是为什么我使用“ data.table”的原因)。 / em>
library(data.table)
validate <- function(data, rule, valid.result, checked.rules) {
# ... find errors
# How to append "rule" to "checked.rules"?
findings <- data.table(err.code = rule$rule.id, msg = "some blah blah") # just an stupid example
# How to append all "finding"s to "valid.results"?
}
data <- data.table(a=1:10, b=21:30)
valid.result <- data.table(err.code = integer(0), msg = character(0)) # empty validation results table
checked.rules <- data.table(rule.id = integer(0), rule.name = character(0)) # empty table
rules <- data.table(rule.id = 1:4, rule.name = c("too big", "too small", "too late", "empty"))
validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)
预期结果:
checked.rules
# rule.id rule.name
# 1: 3 too late
# 2: 1 too big
# 3: 4 empty
valid.results
# err.code msg
# 1: 3 some blah blah
# 2: 1 some blah blah
# 3: 4 some blah blah
答案 0 :(得分:1)
如已经在由@Henrik提供目前data.table
s不能通过引用添加行链路提及。因此,我会选择rbindlist
(它也可以很好地添加多行):
library(data.table)
validate <- function(data, rule, valid.result, checked.rules) {
# ... find errors
# How to append "rule" to "checked.rules"?
checked.rules <<- rbindlist(list(checked.rules, rule))
findings <- data.table(err.code = rule$rule.id, msg = "some blah blah") # just an stupid example
# How to append all "finding"s to "valid.results"?
valid.result <<- rbindlist(list(valid.result, findings))
}
data <- data.table(a=1:10, b=21:30)
valid.result <- data.table(err.code = integer(0), msg = character(0)) # empty validation results table
checked.rules <- data.table(rule.id = integer(0), rule.name = character(0)) # empty table
rules <- data.table(rule.id = 1:4, rule.name = c("too big", "too small", "too late", "empty"))
validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)
print(checked.rules)
print(valid.result)
答案 1 :(得分:1)
阅读了评论中的链接以及@ismirsehregal的建议使用list
后,我最终使用了environment
,以便可以“通过引用”收集多个结果。
我为两个变体做了基准测试
rbind
将每个函数调用结束时的中间结果转换为“累积”结果(“在函数内附加”)。
收集每个函数调用的中间结果,并且仅在末尾rbindlist
收集一次(“追加到函数外部”)。
代码经过简化,从而产生了abt。 20个函数调用后出现9个mio行:
library(data.table)
library(microbenchmark)
validate.rbind <- function(data, results) {
findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1)) # just an stupid example
results$valid.result <- rbind(results$valid.result, findings) # same as: rbindlist(list(results$valid.result, findings))
}
validate.rbindlist <- function(data, results) {
findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1)) # just an stupid example
assign(paste0("res", sprintf("%02d", results$counter)), findings, envir = results)
results$counter = results$counter + 1
}
microbenchmark(
rbind.per.call = {
set.seed(0815) # make random numbers reproducible
data <- data.table(a=1:100, b=21:30)
results <- new.env() # use an environment to pass arguments by reference
results$valid.result <- data.table(err.code = integer(0), msg = character(0)) # empty validation results table
for (i in 1:20) {
validate.rbind(data, results)
}
},
rbindlist.once = {
set.seed(0815) # make random numbers reproducible
data <- data.table(a=1:100, b=21:30)
results <- new.env() # use an environment to pass arguments by reference
results$counter <- 1
for (i in 1:20) {
validate.rbindlist(data, results)
}
result.vars <- ls(envir = results, pattern = "^res.*") # identify the result tables via the used naming pattern
results$valid.result <- rbindlist(mget(result.vars, envir = results))
rm(list = result.vars, envir = results) # remove the intermediate result tables (keep only the total result)
},
times = 10)
解决方案2快四倍
Unit: milliseconds
expr min lq mean median uq max neval
rbind.per.call 1021.2956 1114.8187 1198.7033 1153.7775 1324.6672 1477.5669 10
rbindlist.once 231.0477 249.7195 305.0974 260.2499 275.3446 713.1155 10
并且内存占用量(用gc()
观察)甚至更好:
# Memory consumption for rbind.per.call:
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 510152 27.3 940480 50.3 847768 45.3
# Vcells 19636460 149.9 55027624 419.9 52254173 398.7
# Memory consumption for rbindlist.once:
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 604335 32.3 1168576 62.5 940480 50.3
# Vcells 19859703 151.6 55503896 423.5 39082073 298.2
PS:我没有测试链接的set
变体,因为我不期望有更好的性能并且使用起来更复杂