我编写了一个函数来使用RCurl来获取缩短的URL重定向(bit.ly,t.co等)列表的有效URL,并在有效URL定位文档时处理错误(PDF倾向于抛出“ curlPerform中的错误...在字符串中嵌入了nul。“)
如果可能的话,我想更有效地使这个功能(同时保持在R中)。正如所写,运行时间非常长,可以缩短一千个或更多的URL。
?getURI
告诉我们默认情况下,当url向量的长度> 1时,getURI / getURL会异步。但是我的表现看起来完全是线性的,大概是因为sapply
把事物变成了一个大的for循环而且并发性就丢失了。
无论如何我可以加快这些要求吗?修复“嵌入式”问题的额外功劳。
require(RCurl)
options(RCurlOptions = list(verbose = F, followlocation = T,
timeout = 500, autoreferer = T, nosignal = T,
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))
# find successful location (or error msg) after any redirects
getEffectiveUrl <- function(url){
c = getCurlHandle()
h = basicHeaderGatherer()
curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
possibleError <- tryCatch(getURI( url, curl=c, followlocation=T,
headerfunction = h$update, async=T),
error=function(e) e)
if(inherits(possibleError, "error")){
effectiveUrl <- "ERROR_IN_PAGE" # fails on linked documents (PDFs etc.)
} else {
headers <- h$value()
names(headers) <- tolower(names(headers)) #sometimes cases change on header names?
statusPrefix <- substr(headers[["status"]],1,1) #1st digit of http status
if(statusPrefix=="2"){ # status = success
effectiveUrl <- getCurlInfo(c)[["effective.url"]]
} else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) }
}
effectiveUrl
}
testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
"http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
"http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
"http://t.co/cHglxQkz6Z") # 10th URL redirects to content w/ embedded nul
system.time(
effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop
# does Vectorize help?
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse
答案 0 :(得分:3)
我对RCurl和Async请求有不好的经验。 R将完全冻结(虽然没有错误消息,CPU和RAM没有尖峰)只有并发的20个请求。
我建议切换到CURL并使用curl_fetch_multi()函数。在我的情况下,它可以轻松地在一个池中处理50000 JSON请求(在引擎盖下有一些划分为子池)。 https://cran.r-project.org/web/packages/curl/vignettes/intro.html#async_requests