我正在尝试获取从特定Google搜索中获得的结果数量。 例如,对于stackoverflow,有“大约28,200,000个结果(0.12秒)”。
通常我会使用XML R包中的xpathSApply函数,但我遇到错误,不知道如何解决它们或知道是否有替代方法
library(XML)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
Error: failed to load external entity "https://www.google.ca/search?q=stackoverflow"
#use of RCurl which I am not that familiar with
library(RCurl)
getURL(googleURL)
#Error in function (type, msg, asError = TRUE) :
#SSL certificate problem, verify that the CA cert is OK. Details:
#error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
# final effort
library(httr)
x <- GET(googleURL)
# no error but am not sure how to proceed
# the relevant HTML code to parse is
# <div id=resultStats>About 28,200,000 results<nobr> (0.12 seconds) </nobr></div>
非常感谢帮助解决错误或解析httr对象
答案 0 :(得分:3)
您要求安全的http连接
https://www.google.ca/search?q=stackoverflow
XML
正在抱怨RCurl
。 httr
将下载该页面。
XML
要求不安全的连接
library(XML)
googleURL <- "http://www.google.ca/search?q=stackoverflow"
googleInfo <- htmlParse(googleURL, isURL = TRUE)
xpathSApply(googleInfo,'//*/div[@id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
RCurl
使用ssl.verifypeer = FALSE
你没有为我工作
library(RCurl)
googleURL <- "https://www.google.ca/search?q=stackoverflow"
googleInfo <- getURL(googleURL,ssl.verifypeer = FALSE)
googleInfo <- htmlParse(googleInfo)
# or if you want to use a cert
# system.file("CurlSSL/cacert.pem", package = "RCurl")
# googleInfo <- getURL(googleURL, cainfo = cert)
# googleInfo <- htmlParse(googleInfo)
xpathSApply(googleInfo,'//*/div[@id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>
httr
使用content
library(httr)
x <- GET(googleURL)
googleInfo <- htmlParse(content(x, as = 'text'))
xpathSApply(googleInfo,'//*/div[@id="resultStats"]')
#[[1]]
#<div id="resultStats">About 28,200,000 results</div>