读取R中文件夹中的本地html文件的倍数

时间:2017-08-02 18:56:05

标签: html r tm

我的电脑上的文件夹中有几个HTML文件。我想在R中阅读它们,试图保持原始格式尽可能多。顺便说一句,只有文字。我尝试过两种方法,但都失败了:

##first approach
 library (tm)
 cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
  docs <- Corpus(DirSource(cname))
## second approach
 list_files_path<- list.files(path = './gazzetes.presihtml')
 a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
 rawHTML <- readLines(a)

有什么猜测?一切顺利

1 个答案:

答案 0 :(得分:1)

你的第二种方法即将开始工作,除了readLines只接受一个连接,但你给它一个包含多个文件的向量。您可以lapplyreadLines一起使用来实现此目的。这是一个例子:

# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')

# readLines for each file and put them in a list
lineList <- lapply(files, readLines)

# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)

# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')

# print the string with original formatting
cat(html)