我的电脑上的文件夹中有几个HTML文件。我想在R中阅读它们,试图保持原始格式尽可能多。顺便说一句,只有文字。我尝试过两种方法,但都失败了:
##first approach
library (tm)
cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
docs <- Corpus(DirSource(cname))
## second approach
list_files_path<- list.files(path = './gazzetes.presihtml')
a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
rawHTML <- readLines(a)
有什么猜测?一切顺利
答案 0 :(得分:1)
你的第二种方法即将开始工作,除了readLines
只接受一个连接,但你给它一个包含多个文件的向量。您可以lapply
与readLines
一起使用来实现此目的。这是一个例子:
# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')
# readLines for each file and put them in a list
lineList <- lapply(files, readLines)
# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)
# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')
# print the string with original formatting
cat(html)