将txt文件的目录逐行读入R数据帧,文件名为一列

时间:2017-04-23 15:31:09

标签: r dplyr tidyverse readr

我有一个文本文件目录。我想将这些文本文件的内容逐行读入R数据帧。文本文件包含非结构化文本。所需的数据帧输出为:

file; line
1.txt; "line 1 in 1.txt"
1.txt; "line 2 in 1.txt"
2.txt; "line 1 in 2.txt"
...

我已经编写了下面的代码,但它会导致错误。我还猜测有一种更简单的方法,例如readrdplyr

files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE) # read data folder txt files

my_lines <-list() # create temp list for reading lines
df <- data_frame( "file" = character(0), "line" = character(0))

for (file in files){
    my_lines <- readLines(file) # read lines from file into a list
    for (line in my_lines){
        df$file<-file
        df$fline<-line
    }
}

2 个答案:

答案 0 :(得分:1)

一个简单(但效率低下)的解决方案是:

files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE)
fls <- NULL
lns <- NULL
for (file in files) {
  my_lines <- readLines(file)
  for (line in my_lines) {
    fls <- c(fls, file)
    lns <- c(lns, line)
  }
}
df <- data.frame(file=fls, fline=lns)
print(df)

   file          fline
1 1.txt line1_in_1.txt
2 1.txt line2_in_1.txt
3 2.txt line1_in_2.txt
4 2.txt line2_in_2.txt

答案 1 :(得分:1)

没有循环的替代解决方案:

> file = list.files(path="C:/...", pattern = "*.txt",full.names=T)
> line = lapply(file,readLines) 
> file = unlist(mapply(rep,file,sapply(line,length),SIMPLIFY=FALSE,USE.NAMES=FALSE))
> df=data.frame(file=file,line=unlist(line))

full.names设置为TRUE会产生很长的文件名... 如果事先设置工作目录,则pathfull.names list.files()参数不是必需的, 您的数据框只包含没有路径的实际文件名。