我现在已经足够长时间困惑,似乎无法弄清楚如何绕过它。最容易提供虚拟代码:
require(RCurl)
require(XML)
#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0"
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
curl = curl
)
list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms')
for ( i in seq( list1 ) ){
print(list1[i])
html <-
try( getURL(
list1[i],
maxredirs = as.integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
if (class (html) == "try-error") {
print(paste("error accessing",list1[i]))
rm(html)
gc()
next
} else {
print('success')
}
}
gc()
for ( i in seq( list2 ) ){
print(list2[i])
html <-
try( getURL(
list2[i],
maxredirs = as.integer(20),
followlocation = TRUE,
curl = curl
),TRUE)
if (class (html) == "try-error") {
print(paste("error accessing",list2[i]))
rm(html)
gc()
next
} else {
print('success')
}
}
这应该可以在安装RCurl和XML库的情况下运行。关键是当我将http://timesofindia.indiatimes.com//articleshow/2933019.cms
插入列表中的第二个位置时,它会杀死循环其余部分的成功(其他链接是相同的)。当链接包含PDF(检查以查看)时,会发生这种情况(在此情况和其他情况下一致)。
有关如何解决此问题的任何想法,以便获取包含PDF的链接不会杀死我的循环?正如你所看到的,我试图清除可能有问题的对象,gc()
到处都是等等,但我无法弄清楚为什么PDF会杀死我的循环。
谢谢!
只是为了检查,这是我的两个for
循环的输出:
#[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
#[1] "success"
和
#[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
#[1] "success"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
#[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
#[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"
答案 0 :(得分:0)
您可能会发现使用httr更容易。它包装RCurl并默认设置您需要的选项。这是与httr:
的等效代码require(httr)
urls <- c(
'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
)
responses <- lapply(urls, GET)
sapply(responses, http_status)
sapply(responses, function(x) headers(x)$`content-type`)