我使用环境作为哈希表。键是来自常规文本文档的单词,值是单个整数(指向其他结构的索引)。
当我加载数百万个元素时,更新和查找都会慢慢爬行。下面是一些显示行为的代码。
看起来,从O开始的行为在O(n)中比在O(1)中更多,即使哈希表填充量很小且冲突很小。
有关为什么更新和查找时间不断增加以及如何获得更好性能的任何建议?
谢谢
testhInt = function (I=100L,N=100000L) {
cat("create initial table: ", I*N,
system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv())), "\n")
for ( i in 1L:I){
cat("Updated", i*N,": ",
system.time(for (j in ((i-1L)*N+1):(i*N)) {
h[[as.character(j)]] <- j
}),
"\n")
p = env.profile(h)
cat(sprintf("Hash size: %i, nchains: %i, 1:200:%s\n",
p$size, p$nchains, toString(p$counts[1:200])))
cat("Lookup 1000 hash:",
system.time(resultv <<- sapply(sample(1L:(i*N), 1000L),
function(i) h[[as.character(i)]])),
"\n")
}
}
testhInt()
create initial table: 10000000 0.089 0.081 0.169
Updated 100000 : 2.352 0.045 2.392
Hash size: 10000000, nchains: 100000, 1:200:[text removed]
Lookup 1000 hash: 0.016 0.001 0.018
Updated 200000 : 3.587 0.057 3.622
Hash size: 10000000, nchains: 200000, 1:200:[text removed]
Lookup 1000 hash: 0.014 0.002 0.017
Updated 300000 : 4.649 0.064 4.695
Hash size: 10000000, nchains: 300000, 1:200:[text removed]
Lookup 1000 hash: 0.024 0.003 0.027
Updated 400000 : 5.76 0.076 5.8
Hash size: 10000000, nchains: 400000, 1:200:[text removed]
Lookup 1000 hash: 0.023 0.003 0.026
...
Updated 1200000 : 12.299 0.167 12.469
Hash size: 10000000, nchains: 1200000, 1:200:[text removed]
Lookup 1000 hash: 0.071 0.01 0.084
...
Updated 2600000 : 28.537 0.273 28.836
Hash size: 10000000, nchains: 2600000, 1:200:[text removed]
Lookup 1000 hash: 0.138 0.02 0.158
答案 0 :(得分:4)
环境更新和查找不是问题。加载数百万个元素时变慢的原因是:1)生成序列,2)将它们转换为字符,3)计算随机样本。
如果您将这些操作移到时间之外,并且只有哈希表更新和查找的时间,您将看到哈希表的性能不会降低。
testhInt2 <- function (I=100L, N=100000L) {
t1 <- system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv()))
cat("create initial table: ", I*N, t1, "\n")
for ( i in 1L:I) {
jSeq <- ((i-1L)*N+1):(i*N)
jName <- as.character(jSeq)
t2 <- system.time(for(j in seq_along(jSeq)) h[[jName[i]]] <- jSeq[i])
cat("Updated", i*N,": ", t2, "\n")
p <- env.profile(h)
cat(sprintf("Hash size: %i, nchains: %i, 1:200: %s\n",
p$size, p$nchains, "..."))
v <- as.character(sample(1L:(i*N), 1000L))
t3 <- system.time(resultv <<- sapply(v, function(i) h[[i]]))
cat("Lookup 1000 hash:", t3, "\n")
}
}
testhInt2()
create initial table: 10000000 0.012 0.028 0.04 0 0
Updated 100000 : 6.148 0.004 6.174 0 0
Hash size: 10000000, nchains: 1, 1:200: ...
Lookup 1000 hash: 0.084 0 0.086 0 0
Updated 200000 : 6.872 0 6.9 0 0
Hash size: 10000000, nchains: 2, 1:200: ...
Lookup 1000 hash: 0.088 0 0.089 0 0
...
Updated 2500000 : 5.528 0.012 5.557 0 0
Hash size: 10000000, nchains: 25, 1:200: ...
Lookup 1000 hash: 0.052 0 0.052 0 0
Updated 2600000 : 4.844 0 4.863 0 0
Hash size: 10000000, nchains: 26, 1:200: ...
Lookup 1000 hash: 0.052 0 0.052 0 0