Question

我正在研究R并学习如何编码。我已经编写了一段代码，利用了for循环，但发现它很慢。我想知道是否可以得到一些帮助，将其转换为使用sapply或lapply函数。这是我的工作R代码：

library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332)   {
      files_list <- list.files(directory, full.names=TRUE)   #creates a list of files
      dat <- data.frame()                             #creates an empty data frame
      for (i in seq_along(files_list)) {
            #loops through the files, rbinding them together
            dat <- rbind(dat, read.csv(files_list[i]))
      }
      dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
      mean(dat_subset[, pollutant], na.rm=TRUE)      #identifies the Mean of a Pollutant
}

pollutantmean("specdata", "sulfate", 1:10)

此代码将花费近20秒的时间返回，这对于332条记录是不可接受的。想象一下，如果我有一个包含1万条记录的数据集，并想获取这些变量的均值？

Answer 1

您可以使用do.call lapply列表中的所有元素，也可以使用mean( filter( # here's the filter that will be applied to the rbind-ed data do.call("rbind", # call "rbind" on all elements of a list lapply( # create a list by reading in the files from list.files() # add any necessary args to read.csv: list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...) ) ) ), ID %in% id)$pollutant, # make sure id is replaced with what you want na.rm = TRUE )将所有文件读入该列表：

Caused by: org.springframework.beans.factory.support.BeanDefinitionOverrideException: Invalid bean definition with name 'meterRegistry' defined in class path resource [applicationContext.xml]: Cannot register bean definition [Generic bean: class [io.micrometer.core.instrument.logging.LoggingMeterRegistry]; scope=; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in class path resource [applicationContext.xml]] for bean 'meterRegistry': There is already [Generic bean: class [io.micrometer.core.instrument.logging.LoggingMeterRegistry]; scope=; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in class path resource [applicationContext.xml]] bound.

Answer 2

代码缓慢的原因是因为您正在循环中逐步增长数据框。使用dplyr中的map_df和purrr的一种方法可以是

library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332)   {

    files_list <- list.files(directory, full.names=TRUE)  
    purrr::map_df(files_list, read.csv) %>%
                  filter(ID %in% id) %>%
                  summarise_at(pollutant, mean, na.rm = TRUE)

}

需要帮助将for循环转换为lapply或sapply

2 个答案: