data.table中的get(columnname)vs [[columnname]]

时间:2018-03-18 09:22:21

标签: r data.table

在很多情况下,当我需要在变量中传递其名称的列时,我会看到以下两个选项:myDT[[myCol]]myDT[,get(myCol)],例如:

# get() ####
cast_num_get <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    inpDT[, (thisCol):=as.numeric(get(thisCol))]
  }
  return(inpDT);
}

# [[ #### 
cast_num_b <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    inpDT[[thisCol]] <- inpDT[[thisCol]]
  }
  return(inpDT);
}


# two more options added from the comments: 

# lapply(.SD) ####
cast_num_apply <- function(inpDT, cols2cast){
  inpDT[, (cols2cast) := lapply(.SD, as.numeric), .SDcols = cols2cast]
  return(inpDT);
}

# set() ####
cast_num_for_set <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    set(inpDT, j = thisCol, value = as.numeric(inpDT[[thisCol]]))
  }
  return(inpDT);
}

1 个答案:

答案 0 :(得分:3)

对于这个例子,我会使用:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

两个替代方案(基于my answer here)与for

# alternative 1 with 'set'
for (col in cols) set(DT, j = col, value = as.numeric(DT[[col]]))

# alternative 2 with ':='
for (col in cols) DT[, (col) := as.numeric(DT[[col]])]

这三种方法都不一定更好。它们都具有相同的优势:它们将通过引用更新DT

将不同方法与基准进行比较:

microbenchmark(vasily_get = {inpDT <- copy(DT); cast_num_get(inpDT, cols)},
               vasily_b = {inpDT <- copy(DT); inpDT <- cast_num_b(inpDT, cols)},
               jaap_lapply = {inpDT <- copy(DT); inpDT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]},
               jaap_for_set1 = {inpDT <- copy(DT); for (col in cols) set(inpDT, j = col, value = as.numeric(inpDT[[col]]))},
               jaap_for_set2 = {inpDT <- copy(DT); for (col in cols) inpDT[, (col) := as.numeric(inpDT[[col]])]},
               times = 100)

给出:

Unit: milliseconds
          expr      min       lq     mean   median       uq      max
    vasily_get 399.0723 414.2708 530.3024 429.5070 663.3513 1194.827
      vasily_b 388.7294 408.0004 528.4039 418.9236 664.5881 1441.941
   jaap_lapply 401.8001 424.1902 562.9259 453.5073 668.3900 1376.654
 jaap_for_set1 399.2213 433.9918 568.7211 628.4220 668.1248 1198.950
 jaap_for_set2 395.1966 405.5584 510.2038 421.3801 652.1263 1097.931

这两种方法在速度方面都不突出。但是,cast_num_b aproach有一个很大的缺点:要使更改成为永久更改,您必须将该函数的结果分配回输入 data.table

运行以下代码时:

inpDT <- copy(DT)
address(inpDT)
inpDT <- cast_num_b(inpDT, cols)
address(inpDT)

你得到:

> inpDT <- copy(DT)
> address(inpDT)
[1] "0x145eb6a00"
> inpDT <- cast_num_b(inpDT, cols)
> address(inpDT)
[1] "0x12a632ce8"

如您所见,计算机内存中的位置已更改。因此,它可以被认为是效率较低的方法。

使用过的数据:

DT <- data.table(lets = sample(LETTERS, 1e6, TRUE),
                 V1 = as.character(rnorm(1e6)),
                 V2 = as.character(rnorm(1e6)),
                 V3 = as.character(rnorm(1e6)),
                 V4 = as.character(rnorm(1e6)))

cols <- names(DT)[2:5]