尝试解析希伯来语.HTML网页,并且在使用RCurl工具时遇到问题。我一直在阅读以下内容:
我使用了以下R代码:
library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&dealType=1&dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")
readLines()生成正确的希伯来语(בעלימקצוע);
txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"
htmlParse()搞砸了('•' - ' - ''''''''''''''''''''''''''''''' -'™''''''“)。
<a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -
有什么想法吗?
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6 XML_3.98-1.1
loaded via a namespace (and not attached):
[1] tools_3.1.1
答案 0 :(得分:3)
我无法重现你的问题。以下是我采取的步骤:
首先尝试一个非常简单的HTML 5文档:
library(XML)
# This is the simplest valid HTML-5
# http://www.brucelawson.co.uk/2010/a-minimal-html5-document/
hebrew1 <- "
<!doctype html>
<title>בעלי מקצו</title>
"
htmlParse(hebrew1) # NOT OK
#> <!DOCTYPE html>
#> <html><head><title>××¢×× ×קצ×</title></head></html>
#>
htmlParse(hebrew1, encoding = "UTF-8") # OK
#> <!DOCTYPE html>
#> <html><head><title>בעלי מקצו</title></head></html>
#>
hebrew2 <- "
<!doctype html>
<meta charset=utf-8>
<title>בעלי מקצו</title>
"
htmlParse(hebrew2) # OK
#> <!DOCTYPE html>
#> <html><head>
#> <meta charset="utf-8">
#> <title>בעלי מקצו</title>
#> </head></html>
#>
直接从网址试用:
url <- "http://www.agora.co.il/toGet.asp?searchType=searchAll&dealType=1&dealStatus=1"
html <- htmlParse(url, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>
从磁盘加载:
tmp <- tempfile()
download.file(url, tmp)
html <- htmlParse(tmp, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>
从行加载
lines <- readLines(tmp)
html <- htmlParse(lines, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>