Question

我正在抓一个网站。经过数千次迭代后，服务器停机，系统卡住，我收到以下错误：

Error in curl::curl(path) : all connections are in use

有没有办法退出循环而不会丢失我到目前为止已经删除的数据，这值得一周下载？

library(rvest)
url <- paste("http://www.example.com",(1:130000))
GNR <- lapply(url,function(i) {
  Sys.sleep(2)
  try(list(html_text(html_nodes(read_html(i), "h7")),
       html_text(html_nodes(read_html(i), "#MainContent_IndividualUC_lblBirth"))
  ))
})

（很抱歉没有提供可重复的示例;如果我知道如何重新创建错误，我就不会发布问题！）

Answer 1

您是否可以将抓取的数据写入文件，并在每次获得新数据时附加到该文件中？

url <- paste("http://www.example.com",(1:130000))
GNR <- lapply(url,function(i) {
  Sys.sleep(2)
  scrapedTmp <- try(list(html_text(html_nodes(read_html(i), "h7")),
           html_text(html_nodes(read_html(i), "#MainContent_IndividualUC_lblBirth"))
  ))
  write(unlist(scrapedTmp), file="path_to-text-file", append=TRUE)
})

r - 如何在不丢失数据的情况下取消循环

1 个答案: