如何在R中循环访问CSV文件的文件夹

时间:2016-10-15 19:48:10

标签: r file loops csv

我有一个文件夹,其中包含一堆标题为" yob1980"," yob1981"," yob1982"等

我必须使用for循环来遍历每个文件并将其内容放入数据框中 - 数据框中的列应为" 1980"," 1981", " 1982"等

这就是我所拥有的:

file_list <- list.files()

temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))

names(babynames) <- c("Name", "Gender", "Count")

我觉得我需要一个for循环,但我不确定如何遍历文件。有人指出我正确的方向吗?

4 个答案:

答案 0 :(得分:2)

考虑lapply()中的匿名函数:

files = list.files(pattern="*.txt")

dfList <- lapply(files, function(i) {
     df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
     df$Year <- gsub("yob", "", i) 
     return(df)
})

finaldf <- do.call(rbind, dflist)

答案 1 :(得分:1)

我最喜欢的方法是使用ldply包中的plyr。它具有返回数据帧的优点,因此您不需要在之后执行rbind步骤:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

作为一个额外的好处,您可以非常轻松地对导入进行多线程处理,从而可以更快地导入大型多文件数据集:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

稍微更改上面的内容以在结果数据框中包含Year列,您可以先创建一个函数,然后在ldply中执行该函数,就像执行read.csv一样}

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

这将以简洁整洁的方式为您提供数据,这就是我建议您从这里向前推进的方式。虽然可以将每年的数据分成它自己的专栏,但这可能不是最佳方式。

注意:根据您的偏好,将Year列转换为integer类可能是个好主意。但这取决于你。

答案 2 :(得分:1)

使用purrr

library(tidyverse)

files <- list.files(path = "./data/", pattern = "*.csv")

df <- files %>% 
    map(function(x) {
        read.csv(paste0("./data/", x))
    }) %>%
    reduce(rbind)

答案 3 :(得分:0)

在这种情况下,for循环可能比lapply更合适。

file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))

for (i in seq_along(file_list)) {
    filename = file_list[[i]]

    # Read data in
    df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))

    # Extract year from filename
    year = gsub("yob", "", filename)
    df[["Filename"]] = year

    # Add year to data_list
    data_list[[i]] <- df
}

babynames <- do.call(rbind, data_list)