Question

目前，我正在处理一个大数据集。我在此任务中唯一要做的就是预处理数据。

当我运行我的代码时，我发现我的计算机内存使用此行增加得非常快：

binary <- ifelse(subset_variables1 == "0", 0, 1)

该行唯一应该做的是将所有值都设为二进制。这可以更快地完成吗？或者这是一种很好的方式（我必须处理内存问题）。

Answer 1

使用布尔类型和/或条件时，可以将它们与数学运算符一起使用，它们将被解释为1或0（TRUE和FALSE ）。因此+("0" == 0)会返回1，1 - ("0" == 0)会返回0。

如果您有这样的矢量

set.seed(666)
subset_variables1 <- sample(c("0", "1"), 10000, replace = TRUE)

您可以使用1 - (subset_variables1 == "0")来获得所需的结果。

我将其与评论中的一些建议进行了比较，这是最快的。

library(microbenchmark)

microbenchmark(ifelse = ifelse(subset_variables1 == "0", 0, 1),
               as.numeric = as.numeric(subset_variables1),
               if_else = dplyr::if_else(subset_variables1 == "0", 0, 1),
               plus = 1 - (subset_variables1 == "0"),
               times = 1000
)

Unit: microseconds
       expr     min       lq     mean   median       uq      max neval
     ifelse 686.668 701.3440 977.0863 910.6570 1170.816 3222.192  1000
 as.numeric 631.813 642.5910 715.8687 677.3830  720.841 1819.925  1000
    if_else 347.409 377.0665 537.3344 482.7055  657.468 1603.241  1000
       plus  97.170  98.8845 129.9091 107.8545  146.303  741.557  1000

Answer 2

这是一个更慢但更通用的解决方案

v <- rep(1,length(subset_variables1))
v[subset_variables1 =="0"] <- 0

和ifelse数字向量，

ifelse_sign <- function(test,yes,no){

    if(length(yes)==1)yes = rep(yes,length(test))
    if(length(no) ==1)no  = rep(no ,length(test))

    yes[!test] <- 0
    no [test]  <- 0

    yes + no + test *0
}

ifelse_sign(subset_variables1=="0",0,1)

以更快的方式写入ifelse（）（内存更少）

2 个答案: