使用R中的Assign子文件夹合并1个csv文件中的数据集

时间:2018-10-09 13:24:44

标签: r

我有很多数据集的文件夹

C:/path/folder

文件夹包含子文件夹

/1
/2
/3
...

每个子文件夹都有1-20个csv文件。

因此,我需要将文件夹子文件夹中的所有csv合并到一个csv文件中, 但 每个观察结果必须在其子文件夹中有标记。

示例 如果我合并子文件夹1和子文件夹2中的csv文件,我会得到

newdata=structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "02.01.2018", class = "factor"), 
    Revenue = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Budget = c(6.25, 6.25, 5.92, 
    6.25, 5.92, 6.25, 5.92, 5.92, 5.92, 6.25, 6.25, 6.25, 5.92, 
    6.25, 6.25, 5.92, 5.92, 5.92, 6.25, 5.92)), .Names = c("Date", 
"Revenue", "Budget"), class = "data.frame", row.names = c(NA, 
-20L))

这是一个小错误,我需要为观察结果分配数字子文件夹。 所以输出

Date    Revenue Budget  subfolder
02.01.2018  0   6,25    1
02.01.2018  0   6,25    1
02.01.2018  0   5,92    1
02.01.2018  0   6,25    1
02.01.2018  0   5,92    1
02.01.2018  0   6,25    1
02.01.2018  0   5,92    1
02.01.2018  0   5,92    1
02.01.2018  0   5,92    1
02.01.2018  0   6,25    1
02.01.2018  0   6,25    1
02.01.2018  0   6,25    1
02.01.2018  0   5,92    2
02.01.2018  0   6,25    2
02.01.2018  0   6,25    2
02.01.2018  0   5,92    2
02.01.2018  0   5,92    2
02.01.2018  0   5,92    2
02.01.2018  0   6,25    2
02.01.2018  0   5,92    2

所以从1:12开始的obs被当作子文件夹1 和obs。从13:20取自子文件夹2

分开 子文件夹1

C:/path/folder/subfolder1

f1=structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = "02.01.2018", class = "factor"), Revenue = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Budget = c(6.25, 6.25, 
5.92, 6.25, 5.92, 6.25, 5.92, 5.92, 5.92, 6.25, 6.25)), .Names = c("Date", 
"Revenue", "Budget"), class = "data.frame", row.names = c(NA, 
-11L))

C:/path/folder/subfolder2

f2 =

structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "02.01.2018", class = "factor"), Revenue = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Budget = c(6.25, 5.92, 6.25, 
6.25, 5.92, 5.92, 5.92, 6.25, 5.92)), .Names = c("Date", "Revenue", 
"Budget"), class = "data.frame", row.names = c(NA, -9L))

1 个答案:

答案 0 :(得分:1)

假设您具有以下文件夹结构:

master
 |
 +-- folder1
     | 
     +-- file1.csv
     +-- file2.csv
 +-- folder2
     |
     +-- file1.csv
     +-- file2.csv

并且您的工作目录是“ master”,那么您可以执行以下操作:

# this filters out all non-files (directories) in master
dirs <- list.files()[!grepl("[.]", list.files())]

# This creates the dataframe that will be filled
all_data <- data.frame(Date = character(),
                       Revenue = integer(),
                       Budget = numeric(),
                       dirname = character())

# Loops over directories
for (dirname in dirs) {
  # Get all csv files
  all_csv <- list.files()[grepl(".csv", list.files())]

  # Loops over files in the directory
  for (file in all_csv) {
    tempdata <- read.table(file, stringsAsFactors = FALSE, header = TRUE)
    tempdata$dirname <- dirname
    all_data <- rbind(all_data, tempdata)
  }
}