data.table版本的tidyr :: unite

时间:2016-05-31 11:53:59

标签: r data.table tidyr

我需要在我的巨大data.table中将多个列与分隔符联合起来。所以我使用unite包中的tidyr来获取它。

您知道是否有data.table优化版本吗?

library(data.table)
data <- data.table(id=1:10, col1=11:20, col2=21:30, col3=31:40)
print(data)

library(tidyr)
data <- unite(data, "col_test", col1, col2, col3)
print(data)

1 个答案:

答案 0 :(得分:6)

我们可以将do.callpaste

一起使用
data[, .(id, col_test=do.call(paste, c(.SD, sep="_"))), .SDcols= col1:col3]
#     id col_test
# 1:  1 11_21_31
# 2:  2 12_22_32
# 3:  3 13_23_33
# 4:  4 14_24_34
# 5:  5 15_25_35
# 6:  6 16_26_36
# 7:  7 17_27_37
# 8:  8 18_28_38
# 9:  9 19_29_39
#10: 10 20_30_40

基准

microbenchmark(
  tidyr_unite = {
    unite(data1, "col_test", col1, col2, col3)
  },
  dt_docallPaste = {
    data1[, .(id = data1[["id"]], col_test = do.call(paste, c(.SD, sep="_"))),
         .SDcols= col1:col3]
  },
  apply_Paste = {
    cbind.data.frame(id = data1$id,
                     col_test = apply(data1[, -1, with = FALSE], 1,
                                      paste, collapse = "_"))
  },
  times = 10
)

# Unit: seconds
#            expr       min        lq      mean    median        uq       max neval cld
#     tidyr_unite  7.501491  7.521328  7.720600  7.647506  7.756273  8.219710    10  a 
#  dt_docallPaste  7.530711  7.558436  7.910604  7.618165  8.429796  8.497932    10  a 
#     apply_Paste 44.743782 45.797092 46.791288 46.325188 47.330887 51.155663    10   b

与基础apply相比,看起来tidyrdata.table版本效率相同。这是预期的,因为unite只是do.call("paste", ...)

的包装

正如您从source code

中看到的那样
unite_.data.frame <- function(data, col, from, sep = "_", remove = TRUE) {
  united <- do.call("paste", c(data[from], list(sep = sep)))

  first_col <- which(names(data) %in% from)[1]

  data2 <- data
  if (remove) {
    data2 <- data2[setdiff(names(data2), from)]
  }

  append_col(data2, united, col, after = first_col - 1)
}