Question

我有一个可以在多个维度上过滤的数据集，并且想知道是否有快速的方法来进行过滤......

为了给出一些背景信息，我有6个不同大小的维度，以表示我在R中创建了一个模拟网格的数据，但是最后一个维度变得更小，以便它适合RAM。

d1 <- 3
d2 <- 6
d3 <- 24
d4 <- 7
d5 <- 12
d6 <- 10000 # my actual dataset is actually 500000

full.DT <- data.table(expand.grid(h1=seq(d1),
                                  h2=seq(d2),
                                  h3=seq(d3),
                                  h4=seq(d4),
                                  h5=seq(d5),
                                  h6=seq(d6),stringsAsFactors=FALSE))
# each permutation and combination produces a specific score...
# which I am mimicking using the runif function
full.DT[,score:=runif(nrow(full.DT))]

我想根据h6的2000个唯一值和h2的单个值对其进行过滤。目前我正在使用下面的行...但感觉它有点慢，并且在使用我的真实值d6为500000时肯定不会缩放。

OSX 10.10.3目前需要32秒，2.8 GHz Intel Core i7和16 GB 1600 MHz DDR3。使用密钥会有帮助吗？有没有办法加速这个？还有什么方法可以处理对象大小与RAM问题的大小，（可能通过将数据拆分为单独读取的单独表格）？ C ++是一个潜在的答案吗？

 no.unique.h6.vals <- 2000
 h2.val <- 1
 chosen.h6.vals <- sample.int(d6,no.unique.h6.vals)

 system.time(ff <- full.DT[(h6 %in% chosen.h6.vals) & (h2==h2.val),])
    user  system elapsed 
  15.600   9.571  31.661

此外，虽然这是一种类型的过滤器...但在数据上我理想地希望能够灵活地在任何给定的维度上过滤它，例如： h3，h4和h5等的具体值（例如）......

x <- 10000
h1 <- sample.int(d1,x,replace=TRUE)
h2 <- sample.int(d2,x,replace=TRUE)
h3 <- sample.int(d3,x,replace=TRUE)
h4 <- sample.int(d4,x,replace=TRUE)
h5 <- sample.int(d5,x,replace=TRUE)
h6 <- sample.int(d6,x,replace=TRUE)

g.full <- data.table(h1,h2,h3,h4,h5,h6)

然后过滤所有g.full行...

修改

根据David Arenberg的评论

运行以下行...

setkey(full.DT, h1,h2,h3,h4,h5,h6)

似乎在速度方面有所帮助......

system.time(ff3 <- full.DT[data.table(expand.grid(h1=seq(d1),h2=h2.val,h3=seq(d3),h4=seq(d4),h5=seq(d5),h6=chosen.h6.vals,stringsAsFactors = FALSE)),])

  user  system elapsed 
 2.762   0.420   3.182

这也有助于其他情况：

 system.time(ff4 <- full.DT[g.full])
   user  system elapsed 
  0.024   0.031   0.055

这看起来非常快...... RAM问题是否有任何潜在的解决方案？

data.table快速表搜索

0 个答案: