背景:它以交换优化算法运行。该特定行在内部while循环中运行,因此它执行了很多次。循环中的所有其他内容运行都非常快。
下面创建的示例数据表“ Inventory_test”:
NestCount2 <- c(
"1","1","1","1","1","1","1","1","2","2","3","3","3","3","3","3",
"3","3","3","4","4","4","5","5","5","5","5","5","5","5","5","6",
"6","6","6","6","6","6","6","6","",""
)
Part2 <- c(
"Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
"Shroud","S1Nozzle","S1Nozzle","Shroud","Shroud","Shroud","Shroud",
"Shroud","Shroud","Shroud","Shroud","Shroud","S2Nozzle","S2Nozzle",
"S2Nozzle","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
"Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
"Shroud","Shroud","Shroud","Shroud","Shroud","*","*"
)
Inventory_test <- data.table(data.frame(NestCount2,Part2))
# Methods already tried (have basically exact same performance using profiler):
ptcts <- table(unique(Inventory_test[,c("Part2","NestCount2")])$Part2)
ptcts2 <- Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count
我已经注意到(使用Rstudio分析器),ptcts
行的大约一半时间只是索引Inventory_test[,c("Part2","NestCount2")]
的列。我一直在寻找更快的方法,但是没有找到任何:(。任何帮助将不胜感激!
答案 0 :(得分:0)
我运行了一些基准测试:到目前为止,看来最快的方法是不使用
完全by
,而只是table()
,而不是Inventory_test[, rowSums(table(Part2, NestCount2) > 0L)]
。
library(data.table)
library(microbenchmark)
library(ggplot2)
setkey(Inventory_test, Part2)
microbenchmark(
unit = "relative",
m1 = table(unique(Inventory_test[, c("Part2", "NestCount2")])$Part2),
m2 = Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count,
m3 = Inventory_test[, .N, by = .(Part2, NestCount2)][, .N, by = Part2],
m4 = Inventory_test[, uniqueN(NestCount2), by = Part2]$V1,
m5 = Inventory_test[, uniqueN(paste(Part2, NestCount2)), by = Part2],
m6 = Inventory_test[, length(unique(NestCount2)), Part2],
m7 = Inventory_test[, rowSums(table(Part2, NestCount2) > 0L)]
) -> mb
print(mb, digits = 3)
#> Unit: relative
#> expr min lq mean median uq max neval cld
#> m1 1.26 1.27 1.37 1.32 1.60 1.12 100 b
#> m2 1.28 1.18 1.29 1.16 1.20 5.93 100 b
#> m3 2.21 2.05 2.14 1.98 2.10 3.92 100 c
#> m4 1.25 1.16 1.23 1.14 1.16 3.97 100 ab
#> m5 1.34 1.23 1.28 1.22 1.18 4.27 100 ab
#> m6 1.48 1.37 1.35 1.33 1.35 1.18 100 b
#> m7 1.00 1.00 1.00 1.00 1.00 1.00 100 a
autoplot(mb)
由reprex package(v0.2.0.9000)于2018-07-27创建。
PS。有趣的是,data.table(data.frame(NestCount2, Part2))
实际上比data.table(NestCount2, Part2)
快。这是因为data.frame()
将字符串强制转换为因数,而这些操作在因数上似乎要快一些。
stringsAsFactors = TRUE
曾经做过一些好事-努力吧!