我试图抓一个网站,但无法处理这个编码问题:
# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")
# get content and parse it:
doc <- htmlParse(url)
# encoding isssue, like here..
xpathSApply(doc, '//div[@class="gs_a"]', xmlValue)
[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
[5] "MÃ RodrÃguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
[7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"
[8] "MO Rödel, R Ernst - Ecotropica, 2004 - gtoe.de"
# ...
任何指针?
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.1
> getOption("encoding")
[1] "native.enc"
答案 0 :(得分:2)
这在某种程度上对我有用
doc <- htmlParse(url,encoding="UTF-8")
head(xpathSApply(doc, '//div[@class="gs_a"]', xmlValue))
#[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
#[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
#[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
#[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
#[5] "MÁ Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
#[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
你是
xpathSApply(doc, '//div[@class="gs_a"]', xmlValue)[[81]]
例如,在我的Windows框中显示不正确。
使用GUI首选项切换到字体DotumChe
然而显示它正确显示,因此它可能只是一个显示问题而不是解析问题。