我正在使用R清理存储在硬盘中的html文件,然后导出为txt文件。但是,在输出文本文件中,我看到很多奇怪的字符,例如< U + 0093>,< U + 0094> < U + 0093>在我看来,引号或子弹点(或者其他一些)都没有被正确地解析/显示。我该如何解决这个问题?
以下是我一直在使用的代码:
library(bitops)
library(RCurl)
library(XML)
rawHTML <- paste(readLines("2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n")
doc = htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
write.table(plain.text, file="2488.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
答案 0 :(得分:0)
如果您只需要文本,则可以使用iconv
转换为ASCII。此外,您无需使用write.table
,因为writeLines
会做得很好:
library(bitops)
library(RCurl)
library(XML)
rawHTML <- paste(readLines("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n")
doc <- htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
writeLines(iconv(plain.text, to="ASCII"), "~/Dropbox/2488wl.txt")
您也可以使用rvest
(您仍然需要iconv
):
library(xml2)
library(rvest)
pg <- html("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm")
target <- "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]"
pg %>%
html_nodes(xpath=target) %>%
html_text() %>%
iconv(to="ASCII") %>%
writeLines("~/Dropbox/2488rv.txt")
如果您愿意,也可以避免使用管道:
converted <- iconv(html_text(html_nodes(pg, xpath=target)), to="ASCII")
writeLines(converted, "~/Dropbox/2488rv.txt")