R 3.1.1(32位):htmlParse()搞乱希伯来语文本,OS:Win 7

时间:2014-08-22 11:39:41

标签: r encoding html-parsing hebrew rcurl

尝试解析希伯来语.HTML网页,并且在使用RCurl工具时遇到问题。我一直在阅读以下内容:

我使用了以下R代码:

library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")

readLines()生成正确的希伯来语(בעלימקצוע);

 txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"

htmlParse()搞砸了('•' - ' - ''''''''''''''''''''''''''''''' -'™''''''“)。

    <a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -

有什么想法吗?

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255    LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C                   LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6   XML_3.98-1.1  

loaded via a namespace (and not attached):
[1] tools_3.1.1

1 个答案:

答案 0 :(得分:3)

我无法重现你的问题。以下是我采取的步骤:

  1. 首先尝试一个非常简单的HTML 5文档:

    library(XML)
    
    # This is the simplest valid HTML-5
    # http://www.brucelawson.co.uk/2010/a-minimal-html5-document/
    hebrew1 <- "
      <!doctype html>
      <title>בעלי מקצו</title>
    "
    
    htmlParse(hebrew1) # NOT OK
    #> <!DOCTYPE html>
    #> <html><head><title>××¢×× ×קצ×</title></head></html>
    #> 
    htmlParse(hebrew1, encoding = "UTF-8") # OK
    #> <!DOCTYPE html>
    #> <html><head><title>בעלי מקצו</title></head></html>
    #> 
    
    hebrew2 <- "
      <!doctype html>
      <meta charset=utf-8>
      <title>בעלי מקצו</title>
    "
    htmlParse(hebrew2) # OK
    #> <!DOCTYPE html>
    #> <html><head>
    #> <meta charset="utf-8">
    #> <title>בעלי מקצו</title>
    #> </head></html>
    #> 
    
  2. 直接从网址试用:

    url <- "http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
    html <- htmlParse(url, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>
    
  3. 从磁盘加载:

    tmp <- tempfile()
    download.file(url, tmp)
    html <- htmlParse(tmp, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>
    
  4. 从行加载

    lines <- readLines(tmp)
    html <- htmlParse(lines, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>