如何避免R中特定多条件语句中的循环

时间:2017-01-07 21:18:10

标签: r performance loops bigdata

我正在使用R从R中的电子病历(EMR)进行推断。实际上我确实编写了一个可以工作的循环命令,但问题是当处理数百万个EMR时,循环可能非常慢。那么任何人都可以将我的命令转换为更快的方式(可能是基于矢量的计算或其他可能的方式)? 我的目的是弄清楚一组商品(在这种情况下,它们是从p324到p9274)是否包含一组字符(在这种情况下,它们是I25.2,I21。和I22。)。 这是我的数据样本:

test <- data.frame(p324 = c("I24.001", "I10.x04", "I48.x02", "I48.x01", "I25.201", "I25.201", "I25.101", "I25.101", "NA", "I50", "I25.101", "I25.101", "I25.101", "I45.102", "I50.902"),
p327 = c("I20.000", "K76.000", "E11.900", "I44.200", "NA", "I49.904", "I45.102", "I50.910", "NA", "I10  05", "J98.402", "NA", "NA", "R57.0", "I10.x04"),
p3291 = c("I50.903", "K80.100", "N39.000", "I25.103", "NA", "I50.908", "NA", "I10  04", "NA", "I25.101", "I10  03", "NA", "NA", "I25.101", "I10.x05"),
p3294 = c("I10.x05", "K76.807", "J98.414", "K81.100", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "J43", "I10.x06"),
p3297 = c("NA", "I83.900", "E87.801", "NA", "NA", "I21.620", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x07"),
p3281 = c("K80.100", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x08"),
p3284 = c("K76.807", "I21.620", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x09"),
p3287 = c("I83.900", "I10.x3", "I10.x2", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10"),
p3271 = c("I50.908", "NA", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10", "I10.x11"),
p3274 = c("NA", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10", "I10.x11", "I10.x12"))

这是我的代码:

for (i in 1:15)
{
  if (any(
    c(
      substr(test$p324[i], 1, 5),
      substr(test$p327[i], 1, 5),
      substr(test$p3291[i], 1, 5),
      substr(test$p3294[i], 1, 5),
      substr(test$p3297[i], 1, 5),
      substr(test$p3281[i], 1, 5),
      substr(test$p3284[i], 1, 5),
      substr(test$p3287[i], 1, 5),
      substr(test$p3271[i], 1, 5),
      substr(test$p3274[i], 1, 5)
    ) %in% c("I25.2")
  ) |
  any(
    c(
      substr(test$p324[i], 1, 4),
      substr(test$p327[i], 1, 4),
      substr(test$p3291[i], 1, 4),
      substr(test$p3294[i], 1, 4),
      substr(test$p3297[i], 1, 4),
      substr(test$p3281[i], 1, 4),
      substr(test$p3284[i], 1, 4),
      substr(test$p3287[i], 1, 4),
      substr(test$p3271[i], 1, 4),
      substr(test$p3274[i], 1, 4)
    ) %in% c("I21.", "I22.")
  ))
  test$MI[i] = 1
  else
    test$MI[i] = 0
}

那么,任何人都可以转换我的命令,或者给我一些建议,这样即使案例超过100万,它也可以高效快速地运行吗?非常感谢。

3 个答案:

答案 0 :(得分:2)

我建议使用常规表达,然后使用sapply进行矢量化。

t_test <- as.data.frame(t(test))
chk <- function(x){
  grepl("I25\\.2|I21\\.|I22\\.",x)
}

sapply(t_test,chk)

返回结果将基于true或false,并且可以轻松转换为0或1。

<强> EDIT1 : 我的坏事没有注意到它是基于行的检查。更新了上面的代码。

<强> EDIT2 : 更改回归模式: 1.使用.转义\\。否则,单.表示匹配任何字符 2.将[]更改为|,给定[]表示其中的任何字符是否为真。

答案 1 :(得分:2)

如果您正在寻求性能改进:

  1. 不要逐行循环(sapply也是循环)
  2. 不要在循环中运行矢量化操作(当你只运行一次时,为什么你逐行运行substr?)
  3. 避免使用正则表达式 - 它很慢。相反,如果您在此处理完全匹配,请使用==%in%
  4. 以下是针对您的问题的简单矢量化可能解决方案

    res <- (substr(unlist(test), 1, 5) == "I25.2") | 
           (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
    dim(res) <- dim(test)
    test$MI <- rowSums(res)
    

    对于每个substr(test, k, n) / k组合(总共2个),这基本上只运行一次n,并与感兴趣的值进行比较。然后,(因为%in%没有data.frame方法),我们再次将结果向量转换为正确的格式,并对每行的匹配进行求和(以杂乱的方式)。结果是每行有多少匹配。如果你愿意,它可以很容易地转换成二进制(也是以vectroized方式)

    <强>基准

    所以OP提到了基准测试,所以这里有一些关于10K / 10行/列的基准测试

    1. grepl / sapply解决方案比诉讼解决方案慢约X10
    2. 我建议对grepl解决方案进行矢量化,可以将性能提高约X10倍
    3. 我自己的解决方案与矢量化grepl解决方案的表现非常相似,而我相信它会更好地推广,因为正则表达式将变得更复杂(对于其他匹配),而%in%几乎没有额外算法的边际成本
    4. 设置(使用OP test数据)

      set.seed(123)
      big.df <- as.data.frame(matrix(sample(unlist(test, use.names = FALSE), 1e5, replace = TRUE), ncol = 10))
      
      # sapply / grepl
      SixHu <- function(df) { 
        t_test <- as.data.frame(t(df))
        chk <- function(x){
          grepl("I25\\.2|I21\\.|I22\\.",x)
        }
        unname(colSums(sapply(t_test, chk)))
      }
      
      # Vectorized grepl
      SixHuVec <- function(df) { 
        res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(df))
        dim(res) <- dim(df) 
        rowSums(res)
      }
      
      # Vectorized substr
      David <- function(df) { 
        tmp <- unlist(df)
        res <- (substr(tmp, 1, 5) == "I25.2") | (substr(tmp, 1, 4) %in% c("I21.", "I22."))
        dim(res) <- dim(df)
        rowSums(res)
      }
      

      验证

      identical(SixHu(test), SixHuVec(test))
      ## [1] TRUE
      identical(SixHu(test), David(test))
      ## [1] TRUE
      

      基准测试结果

      microbenchmark::microbenchmark(SixHu(big.df),
                                     SixHuVec(big.df),
                                     David(big.df))
      # Unit: milliseconds
      #             expr       min         lq       mean     median        uq       max neval cld
      #    SixHu(big.df) 989.55655 1021.17121 1047.63956 1041.94771 1062.7705 1151.4196   100   b
      # SixHuVec(big.df)  67.52131   72.39233   84.61193   75.31462   85.5352  147.0646   100  a 
      #    David(big.df)  63.48242   68.20945   88.73896   75.19159  115.3958  147.0867   100  a 
      

答案 2 :(得分:0)

更新1

我比较了使用“sapply&amp; amp; grepl()“来自@ Sixiang.Hu,”grepl()“来自@David Arenburg,而”substr“来自@David Arenburg,看起来sapply代码具有最佳性能。但是,本节中提供的@David Arenburg代码的“substr”会生成许多NA值。可以解释这些NA值产生的原因吗?

> # sapply & grepl()
> start.time <- Sys.time()
> test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
> MIchk <- function(x){
+   grepl("I25\\.2|I21\\.|I22\\.",x)
+ }
> test1 <- sapply(test,MIchk)
> test$MI <- rowSums(test1)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 2.363007 secs
> table(test$MI,exclude = NULL)

     0      1      2   <NA> 
254495   3523     15      0 
> 
> # grepl() 
> start.time <- Sys.time()
> test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
> res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(test))
> dim(res) <- dim(test)
> test$MI1 <- rowSums(res)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 2.51223 secs
> table(test$MI1,exclude = NULL)

     0      1      2   <NA> 
254495   3523     15      0 
> 
> # substr
> start.time <- Sys.time()
> test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
> res <- (substr(unlist(test), 1, 5) == "I25.2") | (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
> dim(res) <- dim(test)
> test$MI2 <- rowSums(res)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 3.473388 secs
> table(test$MI2,exclude = NULL)

     0      1      2   <NA> 
154897   2461     11 100664

更新2

substr操作生成许多NA值的原因是我的数据集包含NA值。所以我执行了以下代码,然后上面提到的三个操作结果一致:

library(dplyr)
test %>% mutate_if(is.factor, as.character) -> test 
test[is.na(test)]<-0

然后我执行了三个代码:

> #=================================
> # sapply & grepl()
> start.time <- Sys.time()
> MIchk <- function(x){
+   grepl("I25\\.2|I21\\.|I22\\.",x)
+ }
> test1 <- sapply(test,MIchk)
> test$MI <- rowSums(test1)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 5.864876 secs
> table(test$MI,exclude = NULL)

     0      1      2   <NA> 
520339   3081     21      0 
> #=================================
> # grepl() 
> start.time <- Sys.time()
> test1 <- subset(test, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
> res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(test1))
> dim(res) <- dim(test1)
> test$MI1 <- rowSums(res)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 17.20333 secs
> table(test$MI1,exclude = NULL)

     0      1      2   <NA> 
520339   3081     21      0 
> #=================================
> # substr
> start.time <- Sys.time()
> test2 <- subset(test, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
> tmp <- unlist(test2)
> res <- (substr(tmp, 1, 5) == "I25.2") | (substr(tmp, 1, 4) %in% c("I21.", "I22."))
> dim(res) <- dim(test2)
> test$MI2 <- rowSums(res)
> end.time <- Sys.time()
> end.time - start.time
Time difference of 4.386484 secs
> table(test$MI2,exclude = NULL)

     0      1      2   <NA> 
520339   3081     21      0 

最后,我还做了一个基准测试,它表明substr操作比sapply / grepl略好,并且明显优于单独的vectorised grepl。以下是我的代码和结果:

#--------------------------------
SixHu <- function(df) { 
  MIchk <- function(x){
    grepl("I25\\.2|I21\\.|I22\\.",x)
  }
  test1 <- sapply(df,MIchk)
  rowSums(test1)
}
#--------------------------------
# Vectorized grepl
SixHuVec <- function(df) { 
  res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(df))
  dim(res) <- dim(df) 
  rowSums(res)
}
#--------------------------------
David <- function(df) { 
  tmp <- unlist(df)
  res <- (substr(tmp, 1, 5) == "I25.2") | (substr(tmp, 1, 4) %in% c("I21.", "I22."))
  dim(res) <- dim(df)
  rowSums(res)
}
> microbenchmark::microbenchmark(SixHu(test),
                                 +                                SixHuVec(test),
                                  +                                David(test))
Unit: seconds
expr       min        lq      mean    median        uq       max neval cld
SixHu(test)  4.323772  4.598328  4.836165  4.760263  4.988194  5.801979   100  b 
SixHuVec(test) 11.867062 12.826925 13.342357 13.243638 13.635339 18.705615   100   c
David(test)  3.728264  4.180152  4.389600  4.344938  4.519908  6.396018   100 a 

因此,@ David Arenburg的矢量化substr()是@ Sixiang.Hu的最佳答案,而@David Arenburg的grepl()则更好。无论如何,这三种方法都比OP的循环要好得多:(。谢谢大家!@David Arenburg @Sixiang.Hu