Question

我有537个.txt文件，我需要将其导入R中的列表或单独的数据框。我不想附加任何数据，因为保持所有内容是至关重要的。

我已经重命名了每个文件，因此文件名都是统一的。在每个文件中，都有一个包含大量杂项信息的标题部分。此标题部分为12-16行，具体取决于文件。对于数据，我有5到7列。数据全部以制表符分隔。列数在5到9列之间变化，并且列并不总是以相同的顺序排列，因此我可以使用数据导入列名称（列名在文件中是统一的）非常重要。该文件的格式如下：

标题

标题...最多16行

（（标题名和列名之间的空格数不同））

日期（\ t）的时间（\ t）的dataCol1（\ t）的dataCol2（\ t）的dataCol3（\ t）的dataCol4

（（列名和单位之间没有空行））

毫米/日/年（\ t）的HH：MM：SS（\ t）的单位（\ t）的单位（\ t）的单位（\ t）的单位

（（单位和数据之间有1个空行））

2016年1月31日（\ t）的14点32分02秒（\ t）的14.9（\ t）的25.3（\ t）的15.8（\吨）25.6

（（数据重复最多4000行））

回顾我需要的东西：将所有文件导入单个数据框或数据框列表。使用“日期”跳过标题信息到行（并且可能删除单位和空行后面的两行），留下一行列名和后面的数据。

这是我为代码工作的原始副本。这个想法是，在将所有文件导入R后，确定每个文件中1-2列的最大值。然后，导出单个文件，每个文件有1行，其中2列包含每个文件的2个最大值。

##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()

##Null list for final data to be extracted to
results <- NULL

##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)

##loop to read in data files and calculate max 
for(i in 1:length(path){
   ##read files
   files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18

   ##will have to add code:
     ##"if columnx exists do this; if columny exists do this"
   ##convert 2 columns for calculation to numeric 
   x.x <- as.numeric(as.character(files$columnx))
   x.y <- as.numeric(as.character(files$columny))

   ##will have to add code: 
     ##"if column x exists, do this....if not, "NA"
   ##get max value for 2 specific columns 
   results$max.x <- max(files$columnx)
   results$max.y <- max(files$columny)
}

##add results to data frame 
max <- data.frame(results)

##export to .csv
write.csv(max,file="PATH")

我现在知道，我的代码只是跳过了所有内容到数据中（最大值直到文件后期才会出现，所以跳过1或2行不会伤害我），它假定列在每个文件中的顺序相同。这是一种可怕的做法，并且在我约5％的数据点上给出了一些不好的结果，但我想要正确地做到这一点。我主要担心的是以可用的格式将数据导入R中。然后，我可以添加其他计算和转换。我是R的新手，经过2天的搜索，我找不到已经发布到任何论坛的帮助。

Answer 1

假设标题的结构遵循Line \ n Line \ n Data，我们可以使用grep查找“mm / dd / yyyy”

的行号

因此：

system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"

然后我们可以strsplit输出到：之前的数字：并使用该参数作为skip的必要值。

as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6

由于在我们调用的行和实际数据之间有一个空行，我们可以向它添加1，然后开始读取数据。

所以：

##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()

##Null list for final data to be extracted to
results <- NULL

##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)

##loop to read in data files and calculate max 
for(i in 1:length(path){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
                         header = FALSE,
                         sep = "\t",
                         skip = skip)

 ##will have to add code:
 ##"if columnx exists do this; if columny exists do this"
 ##convert 2 columns for calculation to numeric 
 x.x <- as.numeric(as.character(files$columnx))
 x.y <- as.numeric(as.character(files$columny))

 ##will have to add code: 
 ##"if column x exists, do this....if not, "NA"
 ##get max value for 2 specific columns 
 results$max.x <- max(files$columnx)
 results$max.y <- max(files$columny)
}

##add results to data frame 
max <- data.frame(results)

##export to .csv
write.csv(max,file="PATH")

我认为这涵盖了一切。

Answer 2

我想在这里添加此内容，以防其他人遇到类似问题。 @TJGorrie的解决方案帮助解决了我略有不同的挑战。我有几个.rad文件需要读入，标记和合并。 .rad文件的标题从随机行开始，因此我需要一种方法来查找带有标题的行。除了创建标记列之外，我不需要执行任何其他计算。希望这对以后的人有所帮助，但感谢@TJGorrie的出色回答！

##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()

##loop to read in data files 
for(i in 1:length(path)){

# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)

#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1

files[[i]] <- read.table(path[[i]],
                         header = TRUE,
                         fill = TRUE,
                         skip = skip,
                         stringsAsFactors = FALSE)

# Name the newly created file objects the same name as the original file. 
names(files)[i] = gsub(".rad", "", (path[i]))

files[[i]] = na.omit(as.data.frame(files[[i]]))

# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind

files[[i]]$Tag = names(files)[i]

# bind all the dfs listed in the file into a single df

df = do.call("rbind",
             c(files, make.row.names = FALSE))
}

##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

将多个.txt文件导入R并跳转到实际数据行

2 个答案: