在R中使用嵌套if优化For循环

时间:2017-04-24 17:08:46

标签: r for-loop nested-loops

我正在尝试将多个csv文件合并到一个数据帧中,并尝试使用for循环操作结果数据帧。结果数据框可能有1,500,000到2,000,000行之间的任何位置。

我正在使用以下代码。

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

但是代码运行的时间很长。是否有更好,更快的方法进行相同的操作?

3 个答案:

答案 0 :(得分:2)

We can make this more optimized by reading with fread from data.table and then using cut/findInterval. This will become more pronounced when it is run in multiple cores, nodes on a server where fread utilize all the nodes and execute parallelly

library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

答案 1 :(得分:1)

您已经在使用dplyr了,为什么不呢:

Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

答案 2 :(得分:0)

Akrun确实是正确的,因为fread远远快于read.csv。

然而,除了他的帖子之外,我还要补充说你的for循环是完全没必要的。他用cut / findInterval替换它,我不熟悉它。然而,就简单的R编程而言,当计算中的某些因子按行更改时,for循环是必需的。但是,在您的代码中,情况并非如此,并且不需要for循环。

当您只需要对列进行一次计算时,基本上您运行的计算最多可达200​​万次。

您可以使用以下内容替换for循环:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

并且您的代码将更快地运行waaaaaay