Question

我有一个R脚本，该脚本使用rvest从AccuWeather中提取一些数据。 AccuWeather URL中具有与城市唯一对应的ID。我正在尝试提取给定范围内的ID和相关的城市名称。 rvest本身可以完美地用于单个ID，但是当我遍历for循环时，它最终会返回此错误-“ open.connection（x，“ rb”）中的错误：HTTP错误502。” < / p>

我怀疑此错误是由于网站阻止了我。我该如何解决？我想从很大的范围（10,000个ID）中抓取，在循环执行约500次迭代后，它一直给我这个错误。我也尝试了closeAllConnections()和Sys.sleep()，但无济于事。我真的很感激此问题的任何帮助。

编辑：已解决。我在这里找到了解决该问题的方法：Use tryCatch skip to next value of loop upon error?。我使用tryCatch()和error = function(e) e作为参数，它抑制了错误消息，并允许循环继续而不会中断。希望这将对陷入类似问题的其他人有所帮助。

library(rvest)
library(httr)

# create matrix to store IDs and Cities
# each ID corresponds to a single city 
id_mat<- matrix(0, ncol = 2, nrow = 10001 )

# initialize index for matrix row  
j = 1

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
}

Answer 1

问题似乎来自科学计数法。

How to disable scientific notation?

我稍微更改了您的代码，现在看来可以了：

library(rvest)
library(httr)

id_mat<- matrix(0, ncol = 2, nrow = 10001 )

readUrl <- function(url) {
out <- tryCatch(
{   
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  return(1)
},
error=function(cond) {

  return(0)
},
warning=function(cond) {
  return(0)
}
)    
return(out)
}

j = 1

options(scipen = 999)

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  url <- paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = "")
  if( readUrl(url)==1) {
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  accu <- read_html("scrapedpage.html")
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
   } else {er <- 1}
  }

使用rvest进行爬网：获取错误HTTP 502

1 个答案: