R环境/哈希表随着它增长到数百万而变慢

时间:2014-03-18 01:23:48

标签: r hash

我使用环境作为哈希表。键是来自常规文本文档的单词,值是单个整数(指向其他结构的索引)。

当我加载数百万个元素时,更新和查找都会慢慢爬行。下面是一些显示行为的代码。

看起来,从O开始的行为在O(n)中比在O(1)中更多,即使哈希表填充量很小且冲突很小。

有关为什么更新和查找时间不断增加以及如何获得更好性能的任何建议?

谢谢

testhInt = function (I=100L,N=100000L) {
   cat("create initial table: ", I*N,
       system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv())), "\n")
   for ( i in 1L:I){
       cat("Updated", i*N,": ", 
           system.time(for (j in ((i-1L)*N+1):(i*N)) {
              h[[as.character(j)]] <- j
                }), 
           "\n")
       p = env.profile(h)
       cat(sprintf("Hash size: %i, nchains: %i, 1:200:%s\n",
                   p$size, p$nchains, toString(p$counts[1:200])))
       cat("Lookup 1000  hash:",
           system.time(resultv <<- sapply(sample(1L:(i*N), 1000L),
                                          function(i) h[[as.character(i)]])),
           "\n")
        } 
}
testhInt()
create initial table:  10000000 0.089 0.081 0.169 
Updated 100000 :  2.352 0.045 2.392 
Hash size: 10000000, nchains: 100000, 1:200:[text removed]
Lookup 1000  hash: 0.016 0.001 0.018 
Updated 200000 :  3.587 0.057 3.622 
Hash size: 10000000, nchains: 200000, 1:200:[text removed]
Lookup 1000  hash: 0.014 0.002 0.017 
Updated 300000 :  4.649 0.064 4.695 
Hash size: 10000000, nchains: 300000, 1:200:[text removed]
Lookup 1000  hash: 0.024 0.003 0.027 
Updated 400000 :  5.76 0.076 5.8 
Hash size: 10000000, nchains: 400000, 1:200:[text removed]
Lookup 1000  hash: 0.023 0.003 0.026 
...
Updated 1200000 :  12.299 0.167 12.469 
Hash size: 10000000, nchains: 1200000, 1:200:[text removed]
Lookup 1000  hash: 0.071 0.01 0.084 
...
Updated 2600000 :  28.537 0.273 28.836 
Hash size: 10000000, nchains: 2600000, 1:200:[text removed]
Lookup 1000  hash: 0.138 0.02 0.158 

1 个答案:

答案 0 :(得分:4)

环境更新和查找不是问题。加载数百万个元素时变慢的原因是:1)生成序列,2)将它们转换为字符,3)计算随机样本。

如果您将这些操作移到时间之外,并且只有哈希表更新和查找的时间,您将看到哈希表的性能不会降低。

testhInt2 <- function (I=100L, N=100000L) {
   t1 <- system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv()))
   cat("create initial table: ", I*N, t1, "\n")

   for ( i in 1L:I) {
       jSeq <- ((i-1L)*N+1):(i*N)
       jName <- as.character(jSeq)
       t2 <- system.time(for(j in seq_along(jSeq)) h[[jName[i]]] <- jSeq[i])
       cat("Updated", i*N,": ", t2, "\n")
       p <- env.profile(h)
       cat(sprintf("Hash size: %i, nchains: %i, 1:200: %s\n",
                   p$size, p$nchains, "..."))
       v <- as.character(sample(1L:(i*N), 1000L))
       t3 <- system.time(resultv <<- sapply(v, function(i) h[[i]]))
       cat("Lookup 1000  hash:", t3, "\n")
   }
}
testhInt2()
create initial table:  10000000 0.012 0.028 0.04 0 0 
Updated 100000 :  6.148 0.004 6.174 0 0 
Hash size: 10000000, nchains: 1, 1:200: ...
Lookup 1000  hash: 0.084 0 0.086 0 0 
Updated 200000 :  6.872 0 6.9 0 0 
Hash size: 10000000, nchains: 2, 1:200: ...
Lookup 1000  hash: 0.088 0 0.089 0 0 
...
Updated 2500000 :  5.528 0.012 5.557 0 0 
Hash size: 10000000, nchains: 25, 1:200: ...
Lookup 1000  hash: 0.052 0 0.052 0 0 
Updated 2600000 :  4.844 0 4.863 0 0 
Hash size: 10000000, nchains: 26, 1:200: ...
Lookup 1000  hash: 0.052 0 0.052 0 0