Question

我的电脑上的文件夹中有几个HTML文件。我想在R中阅读它们，试图保持原始格式尽可能多。顺便说一句，只有文字。我尝试过两种方法，但都失败了：

##first approach
 library (tm)
 cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
  docs <- Corpus(DirSource(cname))
## second approach
 list_files_path<- list.files(path = './gazzetes.presihtml')
 a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
 rawHTML <- readLines(a)

有什么猜测？一切顺利

Answer 1

你的第二种方法即将开始工作，除了readLines只接受一个连接，但你给它一个包含多个文件的向量。您可以lapply与readLines一起使用来实现此目的。这是一个例子：

# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')

# readLines for each file and put them in a list
lineList <- lapply(files, readLines)

# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)

# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')

# print the string with original formatting
cat(html)

读取R中文件夹中的本地html文件的倍数

1 个答案: