Question

我希望从网站上提取外语文本。以下代码（希望是自包含的）将证明问题：

require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)" 
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)

html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008', maxredirs = as.integer(20), followlocation = TRUE, curl = curl)
work <- htmlTreeParse(html, useInternal = TRUE)
table <- xpathApply(work, "//table[@id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[@id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
table[[2]]

控制台打印输出中的第一批字符显示为Â¸Ã\u0089Ã\u0092 iÃ\u0089{Ã\u0089xÃ\u0089 Ã\u008aÂºÃ\u0089EÃ²nÃ¹Â®Ãº。

请注意，如果我转到实际页面（http://bit.ly/1AcE9Gs），并查看页面来源并找到第二个开始<font标记（对应于我table中的第二个列表项，或检查第一个印地语字符附近的元素）页面源中的渲染看起来像这样：¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):这就是我想要的。

任何人都知道为什么会出现这种情况，和/或如何修复？与R或RcURL中的编码有关？我可以看到最初的getURL调用，这些字符与此不同，所以它与从html文本传递到xpathApply无关。

我使用的是MAC OSX 10.9.3，Chrome浏览器（用于查看实际页面），R 3.1.1。

如果有兴趣，请在此处xpathApply查看相关问题：R and xpathApply -- removing duplicates from nested html tags

谢谢！

Answer 1

向htmlParse和getURL添加编码选项：

require(RCurl)
require(XML)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Chrome 39.0.2171.71 (64-bit)" 
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt' ,useragent = agent,followlocation = TRUE , autoreferer = TRUE , curl = curl)

html <-getURL('http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008'
              , maxredirs = as.integer(20), followlocation = TRUE, curl = curl
              , .encoding = 'UTF-8')
work <- htmlParse(html, encoding = 'UTF-8')
table <- xpathApply(work, "//table[@id = 'ctl00_ContPlaceHolderMain_DataList1' ]//font|//table[@id = 'ctl00_ContPlaceHolderMain_DataList1' ]//p", xmlValue) #this one captured some mess in 13
> table[[2]]
[1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, {É½þ±Éä ÊnùxÉ ¨ÉèÆ ¤ÉÉä±É\r\n®ú½þÉ lÉÉ iÉÉä ¨ÉèÆxÉä =iiÉ®ú {ÉÚ´ÉÒÇ ¦ÉÉ®úiÉ Eòä\r\n+ÉiÉÆEò´ÉÉnù {É®ú =ºÉ ÊnùxÉ nùÉä {É½þ±ÉÖ+ÉäÆ EòÉ =±±ÉäJÉ\r\nÊEòªÉÉ lÉÉ* +ÉVÉ ¦ÉÒ ¨ÉèÆ, ÊVÉºÉ EòÉ®úhÉ ºÉä +ÉiÉÆEò´ÉÉnù\r\n{ÉènùÉ ½þÖ+É, =ºÉEòä Ê´É¹ÉªÉ ¨ÉäÆ lÉÉäc÷É ºÉÉ =±±ÉäJÉ\r\nEò°üÆMÉÉ*"

Answer 2

这是使用rvest的替代实现。代码不仅更简单，而且您不必对编码做任何事情，rvest会为您解决这个问题。

library("rvest")
url <- "http://164.100.47.132/LssNew/psearch/result12.aspx?dbsl=1008"

search <- html(url)
search %>% 
  html_node("#ctl00_ContPlaceHolderMain_DataList1") %>%
  html_nodes("font, p") %>%
  html_text() %>% 
  .[[2]]
#> [1] "¸ÉÒ iÉ{ÉxÉ ÊºÉEònù®ú (nù¨Énù¨É):\r\nºÉ¦ÉÉ{ÉÊiÉ ¨É½þÉänùªÉ, ...

RcURL / getURL中的字符与浏览器中的字符不同

2 个答案: