Question

我正在尝试构建一个可以在.csv文件中导入/读取多个数据表的函数，然后计算所选文件的统计信息。 332 .csv文件中的每一个都包含一个具有相同列名的表：Date，Pollutant和id。有很多缺失值。

这是我到目前为止编写的函数，用于计算污染物的平均值：

pollutantmean <- function(directory, pollutant, id = 1:332) { 

  library(dplyr)
  setwd(directory)
  good<-c()

  for (i in (id)){
    task1<-read.csv(sprintf("%03d.csv",i))
  }

  p<-select(task1, pollutant)
  good<-c(good,complete.cases(p))
  mean(p[good,]) 
}

我遇到的问题是，每次进入循环时，都会读取一个新文件，并且已经读取的数据将被新文件中的数据替换。所以我最终得到一个功能与1个单独的文件完美配合，但不是当我想选择多个文件例如如果我要求id = 10:20，我最终只计算在文件20上的平均值。

如何更改代码以便我可以选择多个文件？

谢谢！

Answer 1

我的回答提供了一种方法，可以在不使用循环的情况下做你想做的事情（如果我理解了一切）。我的两个假设是：（1）你有332 * .csv文件具有相同的标题（列名） - 所以所有文件都具有相同的结构，（2）你可以将你的表组合成一个大数据框。

如果这两个假设是正确的，我会使用你的文件列表将你的文件作为数据框导入（所以这个答案不包含循环函数！）。

# This creates a list with the name of your file. You have to provide the path to this folder.
file_list <- list.files(path = [your path where your *.csv files are saved in], full.names = TRUE)

# This will create a list of data frames.
mylist <- lapply(file_list, read.csv)

# This will 'row-bind' the data frames of the list to one big list.
mydata <- rbindlist(mylist)

# Now you can perform your calculation on this big data frame, using your column information to filter or subset to get information of just a subset of this table (if necessary).

我希望这会有所帮助。

Answer 2

也许是这样的？

library(dplyr)

pollutantmean <- function(directory, pollutant, id = 1:332) { 
    od <- setwd(directory)
    on.exit(setwd(od))

    task_list <- lapply(sprintf("%03d.csv", id), read.csv)
    p_list <- lapply(task_list, function(x) complete.cases(select(x, pollutant)))
    mean(sapply(p_list, mean))
}

注意：
- 将所有library个调用放在脚本的开头，它们将更容易阅读。从不在功能内部。
- 在函数内设置工作目录也是一个坏主意。当函数返回时，该更改仍将打开，您可能会丢失。更好的方法是设置wd的外部函数，但由于你已经在函数内部设置了它，我已经相应地添加了代码。

R3.4.1从多个.csv文件中读取数据

2 个答案: