Question

我有一个庞大的数据集（> 250万）。一个小子集看起来像这样（代码可重现）

temp <- data.frame(list(col1 = c("424", "560", "557"), 
                        col2 = c("276", "427", "V46"), 
                        col3 = c("780", "V45", "584"), 
                        col4 = c("276", "V45", "995"), 
                        col5 = c("428", "799", "427")))

> temp
  col1 col2 col3 col4 col5
1  424  276  780  276  428
2  560  427  V45  V45  799
3  557  V46  584  995  427

我正在尝试删除每行重复项，并使用此代码

向左移动值

library(plyr)
temp <- apply(temp,1,function(x) unique(unlist(x)))
temp <- ldply(temp, rbind)

> temp
      1   2   3   4    5
  1 424 276 780 428 <NA>
  2 560 427 V45 799 <NA>
  3 557 V46 584 995  427

我成功地做到了这一点，但是当我将上述代码扩展到我原来庞大的数据集时，我面临着性能问题。因为我使用apply，代码需要花费大量时间来执行

我能改进吗？

Answer 1

如果你只有字符串，你应该使用矩阵而不是数据帧。也许转置它也很有用。

temp <- data.frame(list(col1 = c("424", "560", "557"), 
                        col2 = c("276", "427", "V46"), 
                        col3 = c("780", "V45", "584"), 
                        col4 = c("276", "V45", "995"), 
                        col5 = c("428", "799", "427")),
                   stringsAsFactors = FALSE)

p <- ncol(temp)

myf <- compiler::cmpfun(
  function(x) {
    un <- unique(x)
    d <- p - length(un)
    if (d > 0) {
      un <- c(un, rep(NA_character_, d))
    }
    un
  }
)

microbenchmark::microbenchmark(
  privefl = as.data.frame(t(apply(t(temp), 2, myf))),
  OP = plyr::ldply(apply(temp, 1, function(x) unique(unlist(x))), rbind)
)

小尺寸的结果：

Unit: microseconds
    expr     min       lq      mean   median       uq       max neval
 privefl 278.775 301.7855  376.2803 320.8235 409.0580  1705.428   100
      OP 567.152 619.7950 1027.1277 658.2010 792.6225 29558.777   100

有100,000次观察（temp <- temp[sample(nrow(temp), size = 1e5, replace = TRUE), ]）：

Unit: milliseconds
    expr       min        lq      mean    median       uq      max neval
 privefl  975.1688  975.1688  988.2184  988.2184 1001.268 1001.268     2
      OP 9196.5199 9196.5199 9518.3922 9518.3922 9840.264 9840.264     2

提高每行删除重复值的效率并在R中移位值

1 个答案: