我是R.的新手。我创建了下面的函数来计算332 csv文件中包含的数据集的平均值。寻求有关如何改进此代码的建议。运行需要38秒才能让我觉得效率不高。
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE) #creats list of files
dat <- data.frame() #creates empty dataframe
for(i in id){
dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
good <- complete.cases(dat) #remove all NA values from dataset
mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
答案 0 :(得分:4)
不是每次使用data.frame
创建空格rbind
和for loop
,您都可以将所有data.frames
存储在列表中并一次合并它们。您还可以使用均值函数的na.rm
选项,不要考虑NA
值。
pollutantmean <- function(directory, pollutant, id = 1:332)
{
files_list = list.files(directory, full.names = TRUE)[id]
df = do.call(rbind, lapply(files_list, read.csv))
mean(df[[pollutant]], na.rm=TRUE)
}
可选 - 我会使用magrittr
增加可读性:
library(magrittr)
pollutantmean <- function(directory, pollutant, id = 1:332)
{
list.files(directory, full.names = TRUE)[id] %>%
lapply(read.csv) %>%
do.call(rbind,.) %>%
extract2(pollutant) %>%
mean(na.rm=TRUE)
}
答案 1 :(得分:1)
您可以使用data.table
fread
函数来改进它(请参阅Quickly reading very large tables as dataframes in R)
使用data.table::rbindlist
绑定结果也更快。
require(data.table)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list = list.files(directory, full.names = TRUE)[id]
DT = rbindlist(lapply(files_list, fread))
mean(DT[[pollutant]], na.rm=TRUE)
}