Question

我正在使用R清理存储在硬盘中的html文件，然后导出为txt文件。但是，在输出文本文件中，我看到很多奇怪的字符，例如＆lt; U + 0093＆GT;，＆LT; U + 0094＆GT; ＆LT; U + 0093＆GT;在我看来，引号或子弹点（或者其他一些）都没有被正确地解析/显示。我该如何解决这个问题？

以下是original HTML file

以下是我一直在使用的代码：

library(bitops)
library(RCurl)
library(XML)
rawHTML <- paste(readLines("2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n") 
doc = htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
write.table(plain.text, file="2488.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)

Answer 1

如果您只需要文本，则可以使用iconv转换为ASCII。此外，您无需使用write.table，因为writeLines会做得很好：

library(bitops)
library(RCurl)
library(XML)

rawHTML <- paste(readLines("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n") 
doc <- htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
writeLines(iconv(plain.text, to="ASCII"), "~/Dropbox/2488wl.txt")

您也可以使用rvest（您仍然需要iconv）：

library(xml2)
library(rvest)

pg <- html("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm")

target <- "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]"

pg %>% 
  html_nodes(xpath=target) %>% 
  html_text() %>% 
  iconv(to="ASCII") %>% 
  writeLines("~/Dropbox/2488rv.txt")

如果您愿意，也可以避免使用管道：

converted <- iconv(html_text(html_nodes(pg, xpath=target)), to="ASCII")
writeLines(converted, "~/Dropbox/2488rv.txt")

R HTML清理 - 如何摆脱输出中的奇怪字符？

1 个答案: