从R的行开始合并多个Excel文件

时间:2018-10-24 20:54:35

标签: r excel merge

我有多个Excel文件,我需要将它们合并为一个文件,但只能合并为某些行。 Excel文件如下所示...

所有文件的列标题均相同。我还需要向新生成的文件中添加新的列A,因此我创建了一个仅包含标题和新列A的单独的Excel文件。我的脚本首先读取了此文件(如下)并将其写入工作簿。

接下来,我需要从第9行开始读取每个文件,并逐个合并所有数据。因此,最终结果应如下所示(减去“会员站点”列,我还没有尝试过这样做的逻辑,但认为这将是“标本ID”值的子字符串)...

但是,我目前的结果是...

我目前仅使用3个文件(每个文件有几十行)开始,但最终目标是合并或合并大约15-30个文件(每个文件有25至200行)。所以...

1)我知道我的代码不正确,但是不确定如何获得预期的结果。首先,我的循环正在覆盖数据,因为它在写入时始终从行/列2开始。但是,我想不出如何重写它。

2)日期以通用格式(“ 43008”而不是“ 9/30/2017”)返回

3)某些列数据被放置在不同的列下(例如,“核酸浓度”具有组织日期中的值)。

任何建议或帮助将不胜感激!

我的代码...

library(openxlsx)   # Excel and csv files
library(svDialogs)   # Dialog boxes

setwd("C:/Users/Work/Combined Manifest")

# Create and load Excel file
wb <- createWorkbook()

# Add worksheet
addWorksheet(wb, "Template")

# Read in & write header file
df.headers <- read.xlsx("headers.xlsx", sheet = "Template")

writeData(wb, "Template", df.headers, colNames = TRUE)

# Function to get user path
getPath <- function() { 
  # Ask for path
  path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
  if (dir.exists(path)) {
    # If path exists, set the path as the working directory
    return(path)
  } else {
    # If not, issue an error and recall the getPath function
    dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
    getPath()
  }
}

# Call getPath function
folder <- getPath()

setwd(folder)

# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)

# Get basenames and remove extension 
files.nms <- basename(files)
files.nms <- gsub(pattern.ext, "", files.nms)

# Set the names
names(files) <- files.nms

# Iterate to read in files and write to new file
for (nm in files.nms) {

  # Read in files 
  df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)

  # Write data to sheet
  writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)
}

saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)

编辑: 因此,在下面的循环中,我成功读取了文件并将其合并。感谢您的所有帮助!

for (nm in files.nms) {

  # Read in files 
  df <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
                  skipEmptyCols = FALSE)

  # Append the data
  allData <- rbind(allData, df)
}

编辑:最终解决方案 感谢大家的帮助!

library(openxlsx)   # Excel and csv files
library(svDialogs)   # Dialog boxes

# Create and load Excel file
wb <- createWorkbook()

# Add worksheet
addWorksheet(wb, "Template")

# Function to get user path
getPath <- function() { 
  # Ask for path
  path <- dlgInput("Enter path to files: ", Sys.info()["user"])$res
  if (dir.exists(path)) {
    # If path exists, set the path as the working directory
    return(path)
  } else {
    # If not, issue an error and recall the getPath function
    dlg_message("Error: The path you entered is not a valid directory. Please try again.")$res
    getPath()
  }
}

# Call getPath function
folder <- getPath()

# Set working directory
setwd(folder)

# Get list of files in directory
pattern.ext <- "\\.xlsx$"
files <- dir(folder, full=TRUE, pattern=pattern.ext)

# Get basenames and remove extension 
files.nms <- basename(files)

# Set the names
names(files) <- files.nms

# Create empty dataframe
allData <- data.frame()

# Create list (reserve memory)
f.List <- vector("list",length(files.nms))

# Look and load files
for (nm in 1:length(files.nms)) {

  # Read in files
  f.List[[nm]] <- read.xlsx(files[nm], sheet = "Template", startRow = 8, colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE,
                  skipEmptyCols = FALSE)
}

# Append the data
allData <- do.call("rbind", f.List)

# Add a new column as 'Member Site'
allData <- data.frame('Member Site' = "", allData)

# Take the substring of the Specimen.ID column for Memeber Site
allData$Member.Site <- sapply(strsplit(allData$Specimen.ID, "-"), "[", 2)

# Write data to sheet
writeData(wb, "Template", startCol = 1, allData)

# Save workbook
saveWorkbook(wb, "Combined.xlsx", overwrite = TRUE)

1 个答案:

答案 0 :(得分:2)

首先,您要在问题中提供很多信息,这通常是一件好事,但是我想知道您是否可以通过使用更少和更少的文件来重新创建问题,从而使问题更容易解决。您能否弄清楚如何合并两个文件,每个文件首先包含少量数据?

关于您提出的第一个挑战:

1)是的,您正在每个循环中覆盖工作簿。我建议您加载数据并将其附加到data.frame,然后在加载所有文件后存储最终结果。看下面的例子。请注意,此示例使用rbind,如果您要合并大量文件,则效率很低。因此,如果您有许多文件,则可能需要使用其他结构。

# Create and empty data frame
allData <- data.frame()

# Loop and load files
for(nm in files.nms) {

    # Read in files 
    df <- read.xlsx((files[nm]), sheet = "Template", startRow = 9, colNames = FALSE)

    # Append the data
    allData <- rbind(allData, df)

}

# Write data to sheet
writeData(wb, "Template", df, startCol = 2, startRow = 2, colNames = FALSE)

希望这能使您更接近所需!!

编辑:更新答案以解决所发表的评论

如果文件更多,则rbind会变得缓慢,就像提到的@Parfait一样,这是因为要复制多个数据。避免这种情况的方法是,首先通过创建一个具有足够空间来容纳您的数据的空列表来保留内存中的空间,然后填写该列表,然后最后使用do.call(“ rbind”将所有数据合并在一起,...)。我在下面编译了一些示例代码,这些代码与您在问题中提供的内容一致。

# Create list (reserve memory)
f.List <- vector("list",length(files.nms))

# Loop and load files
for(eNr in 1:length(files.nms)) {

    # Read in files 
    f.List[[eNr]] <- read.xlsx((files.nms[eNr]), sheet = "Template", startRow = 9)

}

# Append the data
allData <- do.call("rbind", f.List)

下面将进一步说明这一点,这是一个小的可重复示例。它仅使用了几个数据框,但它说明了创建列表,填充该列表以及合并数据的过程,这是最后一步。

# Sample data
df1 <- data.frame(x=1:3, y=3:1)
df2 <- data.frame(y=4:6, x=3:1)
df.List <- list(df1,df2)

# Create list
d.List <- vector("list",length(df.List))

# Loop and add data
for(eNr in 1:length(df.List)) {
    d.List[[eNr]] <- df.List[[eNr]] 
}

# Bind all at once
dfAll <- do.call("rbind", d.List)
print(dfAll)

希望这项帮助!谢谢!