使用data.table包

时间:2017-05-09 16:04:52

标签: r data.table

问题的完整上下文可以在https://github.com/ropensci/plotly/issues/981找到,但在这个例子中,我试图消除尽可能多的无关信息。

对于plotly包来构建具有来自输入数据框data的分组信息的形状对象,需要在每个组的末尾重复每个组的第一行以连接边缘一个形状,并且需要在组之间添加一个空行,以便在离散组之间不连接边。在操作之后,输出数据d然后需要恢复类和属性以匹配输入data的那些。

软件包维护者(Carson Sievert)将此操作确定为生成某些类型的图表时最耗时且占用大量内存的步骤之一,并使用c++代替现有dplyr::do(data, dplyr::arrange_(., allVars))请求帮助优化此操作1}}和dplyr::do(data, rbind.data.frame(., .[1,], NA))操作。

由于大部分时间用于对行进行子集化和排序,因此这看起来像data.table使用的索引键和二进制搜索将导致实质性改进而不转向c++的应用程序

我开始主要根据现有问题的答案data.table进行library(data.table) options(datatable.verbose=TRUE) local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL, retrace.first = inherits(data, "GeomPolygon")) { ## store class information from function input retrace <- force(retrace.first) datClass <- class(data) allVars <- c(nested, groupNames, ordered) ## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group ## if retrace.first is FALSE,, just add an empty row to each group d <- if (retrace.first) { data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[c(1:(.N),1,(.N+1))], keyby = index][,index := NULL] } else { data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[1:(.N+1)], keyby = index][,index := NULL] } ## delete last row if all NA's if (all(is.na(d[.N, ]))) d <- d[-.N,] ## return d with the original class structure(d, class = datClass) } 替换,但是我没有获得可以证明交换包依赖性的性能改进。 (取决于data.frame的大小,组的数量等,此版本的运行速度仅比程序包中使用的现有函数快1-10倍)

mtcars

使用> local_group2NA(mtcars, "vs","cyl",retrace.first = TRUE) forder took 0 sec x is already ordered by these columns, no need to call reorder Detected that j uses these columns: <none> Finding groups using uniqlist ... 0 sec Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec Optimization is on but left j unchanged (single plain symbol): '.GRP' Making each group and running j (GForce FALSE) ... memcpy contiguous groups took 0.000s for 5 groups eval(j) took 0.000s for 5 calls 0 secs Finding groups using forderv ... 0 sec Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec lapply optimization is on, j unchanged as '.SD[c(1:(.N), 1, (.N + 1))]' GForce is on, left j unchanged Old mean optimization is on, left j unchanged. Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. dogroups: growing from 32 to 57 rows Wrote less rows (42) than allocated (57). memcpy contiguous groups took 0.000s for 5 groups eval(j) took 0.005s for 5 calls 0.004 secs Detected that j uses these columns: index Assigning to all 42 rows mpg cyl disp hp drat wt qsec vs am gear carb 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 3 NA NA NA NA NA NA NA NA NA NA NA 4 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 5 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 6 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 7 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 10 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 11 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 12 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 13 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 14 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 15 NA NA NA NA NA NA NA NA NA NA NA 16 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 17 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 18 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 19 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 20 NA NA NA NA NA NA NA NA NA NA NA 21 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 22 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 23 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 24 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 25 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 26 NA NA NA NA NA NA NA NA NA NA NA 27 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 28 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 29 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 30 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 31 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 32 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 33 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 34 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 35 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 36 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 37 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 38 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 39 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 40 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 41 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 数据集产生以下输出(使用data.table详细输出):

j

特别是详细输出的一部分引起了我的注意,作为对低于预期性能的潜在解释:

  

制作每个组并运行j(GForce FALSE)... j的结果是   命名列表。在和上创建相同的名称是非常低效的   每个小组再一次。当j = list(...)时,检测到任何名称,   在分组完成后移除并放回,以提高效率。   例如,使用j = transform()可以防止加速(考虑   改为:=)。此消息可能会在将来升级为警告。

我是否有更好的方式表达library(data.table) options(datatable.verbose=FALSE) library(microbenchmark) exampleData <- function(nChunks = 100, nPerChunk = 100) { vals <- c(replicate(nChunks, rnorm(nPerChunk))) ids <- replicate(nChunks, basename(tempfile(""))) ids <- rep(ids, each = nPerChunk) data.frame(vals, group = ids, stringsAsFactors = FALSE) } x1 <- exampleData(1e1) x2 <- exampleData(1e2) x3 <- exampleData(1e3) x4 <- exampleData(1e4) x5 <- exampleData(1e5) res <- microbenchmark( local_group2NA(x1, retrace.first = TRUE), local_group2NA(x2, retrace.first = TRUE), local_group2NA(x3, retrace.first = TRUE), local_group2NA(x4, retrace.first = TRUE), local_group2NA(x5, retrace.first = TRUE), times = 1 ) res Unit: milliseconds expr min lq mean median uq max neval local_group2NA(x1, retrace.first = TRUE) 41.12776 41.12776 41.12776 41.12776 41.12776 41.12776 1 local_group2NA(x2, retrace.first = TRUE) 30.07690 30.07690 30.07690 30.07690 30.07690 30.07690 1 local_group2NA(x3, retrace.first = TRUE) 270.07541 270.07541 270.07541 270.07541 270.07541 270.07541 1 local_group2NA(x4, retrace.first = TRUE) 2779.03229 2779.03229 2779.03229 2779.03229 2779.03229 2779.03229 1 local_group2NA(x5, retrace.first = TRUE) 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861 1 以消除详细消息中解释的问题?

使用以下步骤进行基准测试表明,运行时间似乎与输入数据框的大小成比例增加,我希望可以使用正确的方法将其提高一个或多个数量级。

rbindlist()

除了优化我目前正在使用的方法之外,我还要感谢其他方法的任何建议,以更快的方式获得结果(即使用c++Unit: milliseconds expr min lq mean median uq max neval plotly:::group2NA(x1, retrace.first = TRUE) 3.106003 3.106003 3.106003 3.106003 3.106003 3.106003 1 plotly:::group2NA(x2, retrace.first = TRUE) 4.583826 4.583826 4.583826 4.583826 4.583826 4.583826 1 plotly:::group2NA(x3, retrace.first = TRUE) 10.821644 10.821644 10.821644 10.821644 10.821644 10.821644 1 plotly:::group2NA(x4, retrace.first = TRUE) 93.619315 93.619315 93.619315 93.619315 93.619315 93.619315 1 plotly:::group2NA(x5, retrace.first = TRUE) 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013 1 ,等

更新

非常感谢弗兰克,他在下面的评论让现在的速度提高了一个数量级。运行相同的基准会产生以下结果:

local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL,
                     retrace.first = inherits(data, "GeomPolygon")) {

  ## store class information from function input
  retrace <- force(retrace.first)
  datClass <- class(data)

  allVars <- c(nested, groupNames, ordered)

  ## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group
  ## if retrace.first is FALSE,, just add an empty row to each group
  d <- if (retrace.first) {
    data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
  } else {
    data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
  }

  ## delete last row if all NA's
  if (all(is.na(d[.N, ]))) d <- d[-.N,]

  ## return d with the original class
  structure(d, class = datClass)
}

更新功能

mtcars

另外,作为参考,上面带有更新函数的forder took 0 sec reorder took 0.006 sec Detected that j uses these columns: <none> Finding groups using uniqlist ... 0.001 sec Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec lapply optimization is on, j unchanged as '.I[c(seq_along(.I), 1L, .N + 1L)]' GForce is on, left j unchanged Old mean optimization is on, left j unchanged. Making each group and running j (GForce FALSE) ... dogroups: growing from 32 to 57 rows Wrote less rows (42) than allocated (57). memcpy contiguous groups took 0.000s for 5 groups eval(j) took 0.000s for 5 calls 0.001 secs 示例的详细输出是:

selection.datum()

0 个答案:

没有答案