Question

我有一个包含数字cols的数据框 - id喜欢按行检查这些cols之间的范围，并创建一个包含此范围的新col ....

tool1   tool2   tool3    range
1       34      12       33
na      19      23       4

它必须能够处理NAs，只是忽略它们。

怎么可能这样做..？保罗。

Answer 1

我决定扩展它，因为对R中的行进行操作总是很痛苦。所以我决定将基数R与两个非常有效的包data.table和dplyr进行比较（我不是dplyr专家，所以如果有人想修改我的答案，请这样做）

注意：您的案例不是在行上操作的经典案例，因为它可以使用向量化的pmax和pmin来解决，我们无法始终使用

因此创建比示例

更大的数据

n <- 1e4
set.seed(123)
df <- data.frame(tool1 = sample(100, n, replace = T),
                 tool2 = sample(100, n, replace = T),
                 tool3 = sample(100, n, replace = T))

加载必要的包

library(data.table)
library(dplyr)
library(microbenchmark)

定义功能

apply1 <- function(y) apply(y, 1, function(x) max(x, na.rm = T) - min(x, na.rm = T)) 
apply2 <- function(y) apply(y, 1, function(x) diff(range(x, na.rm = T)))
trans <- function(y) transform(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
DTfunc <- function(y) setDT(y)[, range := pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3)]
DTfunc2 <- function(y) set(y, j = "range", value = with(y, pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))) # Thanks to @Arun for this
dplyrfunc <- function(y) mutate(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))

df2 <- as.data.table(df) # This is in order to avoid overriding df by `setDT` during benchmarking

运行一些基准

microbenchmark(apply1(df), apply2(df), trans(df), DTfunc(df2), DTfunc2(df2), dplyrfunc(df), times = 100)
Unit: microseconds
          expr        min          lq      median         uq        max neval
    apply1(df)  37221.513  40699.3790  44103.3495  46777.305  94845.463   100
    apply2(df) 262440.581 278239.6460 287478.4710 297301.116 343962.869   100
     trans(df)   1088.799   1178.3355   1234.9940   1287.503   1965.328   100
   DTfunc(df2)   2068.750   2221.8075   2317.5680   2400.400   5935.883   100
  DTfunc2(df2)    903.981    959.0435    986.3355   1026.395   1235.951   100
 dplyrfunc(df)   1040.280   1118.9635   1159.9815   1200.680   1509.189   100

似乎第二种data.table方法效率最高。基础R transform和dplyr几乎相同，但由于调用data.table

的开销，效率高于第一个[.data.table方法

在R中按行查找cols的范围

1 个答案: