所以我有一个文件夹目录(每年一次,从1990年到2015年),每个文件夹都有100多个csv'
data/1990/alabama.csv
data/1990/alaska.csv
data/1990/arizona.csv
...
data/1991/alabama.csv
data/1991/alaska.csv
data/1991/arizona.csv
...etc.
我正在应用清理每个csv的功能,并将其保存到另一个文件夹中。
到目前为止,我有这个for循环,它抓取所有文件名并将它们放在一个数据框中,每一行都是新的一年:
filepath <- "~/Desktop/project/data"
setwd(dir = filepath)
top_file_dir = dir()
indi_file_name = sapply(top_file_dir, dir)
filename = as.data.frame("", nrow = length(top_file_dir), ncol = 5000, stringsAsFactors = FALSE )
for (i in 1:length(top_file_dir)){
indi_file_name = sapply(top_file_dir[i], dir)
for (j in 1:length(indi_file_name))
filename[i,j] = paste(top_file_dir[i],indi_file_name[j],sep="/")
}
然后我有一个相当简单的功能,可以整理并整理数据集:
general_clean <- function(currfile=filename) {
geo <- read.csv(file=paste(filepath,currfile,sep="/") , stringsAsFactors=FALSE, colClasses = c("area_fips"="character"))
# remove unwanted columns
keep <- c("area_fips", "year", "area_title")
geoClean <- geo[keep]
# export new data into csv
save_file = paste("~/Desktop/project/output",substring(currfile,21,last=1000),sep="/")
write.csv(geoClean, file=save_file)
}
# apply function, input each year by hand...[1,]=1990, [2,]=1991, etc.
sapply(filename[1,], general_clean)
哪种方法有效,但我想添加一个步骤,将每个较小的csv用于每年的新csv。这似乎涉及创建一个空列表并在&#34; general_clean&#34;中使用rbind。功能不断添加新数据?但这超出了我的R能力,到目前为止我所尝试的一切都是猜测。有什么建议吗?
答案 0 :(得分:0)
这应该让你接近。利用list.file(..., full.names = TRUE)
来保存一堆paste()
来电。
years <- list.dirs("~/Desktop/project/data", full.names = T, recursive = F)
# list only the folders in this folder "data"
general_clean <- function(file) {
geo <- read.csv(file = file,
stringsAsFactors = FALSE,
colClasses = c("area_fips"="character"))
keep <- c("area_fips", "year", "area_title") # move all cleaning into your fxn
geoClean <- geo[keep]
return(geoClean)
}
# move all your cleaning steps into your fxn
for (y in years) {
year_name <- gsub(".*data/(\\d{4}/.*", "//1", y) # make a year name chr vector
states <- dir(y, full.names = T) # now list all files in each year
readin_list <- lapply(states, geoClean) # list of small data frames
readin_dataframe <- do.call(rbind, readin_list) # make it into a big one
write.csv(readin_dataframe, paste0("output/", year_name, "/")) # write it out
}
我99%肯定第一次这不会完美,但由于我无法看到你的所有数据,这是我最好的猜测和一个好的起点。让我知道出了什么问题,我们可以让它处理您的数据:)