我正在研究R并学习如何编码。我已经编写了一段代码,利用了for循环,但发现它很慢。我想知道是否可以得到一些帮助,将其转换为使用sapply或lapply函数。这是我的工作R代码:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
此代码将花费近20秒的时间返回,这对于332条记录是不可接受的。想象一下,如果我有一个包含1万条记录的数据集,并想获取这些变量的均值?
答案 0 :(得分:1)
您可以使用do.call
lapply
列表中的所有元素,也可以使用mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
将所有文件读入该列表:
Caused by: org.springframework.beans.factory.support.BeanDefinitionOverrideException: Invalid bean definition with name 'meterRegistry' defined in class path resource [applicationContext.xml]: Cannot register bean definition [Generic bean: class [io.micrometer.core.instrument.logging.LoggingMeterRegistry]; scope=; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in class path resource [applicationContext.xml]] for bean 'meterRegistry': There is already [Generic bean: class [io.micrometer.core.instrument.logging.LoggingMeterRegistry]; scope=; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in class path resource [applicationContext.xml]] bound.
答案 1 :(得分:0)
代码缓慢的原因是因为您正在循环中逐步增长数据框。使用dplyr
中的map_df
和purrr
的一种方法可以是
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}