Question

我正试图从一个网站上抓取数据，不幸的是，该网站位于非常不可靠的服务器上，该服务器的反应时间非常不稳定。第一个想法当然是循环遍历（数千个）URL列表并通过填充列表来保存下载的结果。

然而问题是服务器随机响应非常缓慢，这导致超时错误。仅此一点不会成为问题因为我可以使用tryCatch()函数并跳转到下一次迭代。这样做我在每次运行中都缺少一些文件。我知道列表中的每个URL都存在，我需要所有数据。

因此，我的想法是使用tryCatch()来评估getURL()请求是否产生错误。如果是这样，循环将跳转到下一次迭代，并且错误的URL将附加在循环运行的URL列表的末尾。我的直观解决方案看起来像这样：

dwl = list()

for (i in seq_along(urs)) {

temp = tryCatch(getURL(url=urs[[i]]),error=function(e){e})

if(inherits(temp,"OPERATION_TIMEDOUT")){ #check for timeout error
urs[[length(urs)+1]]  = urs[[i]] #if there is one the erroneous url is appended at the end of the sequence 
next} else {
dwl[[i]] = temp #if there is no error the data is saved in the list
}
}

如果它“会”工作，我最终将能够下载列表中的所有URL。然而，它不起作用，因为next函数的帮助页面指出：“for循环中的 seq在循环开始时进行评估;随后更改它不会影响循环”。有没有针对这个或一个技巧的解决方法，我可以实现我设想的目标？我很感激每一条评论！

Answer 1

我会这样做（评论中的解释）：

## RES is a global list that contain the final result
## Always try to pre-allocate your results
RES <-  vector("list",length(urs))
## Safe getURL returns NA if error, the NA is useful to filter results
get_url <- function(x) tryCatch(getURL(x),error=function(e)NA)
## the parser!
parse_doc <- function(x){## some code to parse the doc})


## loop while we still have some not scraped urls 
while(length(urs)>0){
  ## get the doc for all urls
  l_doc <- lapply(urs,get_url)
  ## parse each document and put the result in RES 
  RES[!is.na(l_doc )] <<- lapply(l_doc [!is.na(l_doc)],parse_doc)
  ## update urs 
  urs <<- urs[is.na(l_doc)]
}

动态地改变循环的顺序

1 个答案: