Question

简单问题：此代码x <- read_html(url)挂起并读取页面无限秒。我不知道如何处理这个问题，例如，通过设置一些最长的响应时间。我可以使用try，catch，无论重试。但它只是挂起而没有任何反应。有谁知道如何处理它？</ p>

页面没有问题，有时会出现问题，而我手动重试则会有效。

Answer 1

您可以将read_html包装在GET包中的httr函数中

例如如果您的原始代码是

library(rvest)
library(dplyr)

my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)

然后您可以将其替换为

library(httr)

# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html

示例

要进行测试，请尝试设置非常短的超时时间（例如百分之一秒）

# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely

my_url %>% GET(., timeout(0.01)) %>% read_html

# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   Timeout was reached: Resolving timed out after 10 milliseconds

您可以找到更多示例here

循环使用（例如，如果超时则跳到下一个）

尝试运行此代码。假设您要访问许多URL（在这种情况下为3）（下面的第二个URL将在提供html之前延迟3秒，这是测试所需功能的一种好方法）。我们将超时设置为2秒，因此我们知道它将失败。 tryCatch()函数将简单地执行您作为第二个参数输入的任何代码；在这种情况下，它将仅分配“超时！”到列表元素


my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
             "http://httpbin.org/delay/3", # This url will delay 3 seconds
             "http://httpbin.org/delay/1") 

x <- list()

# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {

  print(paste0("Scraping url number ", i))

  tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
           error = function(e) { x[[i]] <<- "Timed out!" } )

}

现在我们检查输出-前两个站点返回了内容，第二个站点超时了

# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
#   [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n    <div id="notify-container"></div>\r\n    <div id="custom ...
# 
# [[2]]
# [1] "Timed out!"
# 
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": ...

显然，您可以将超时值设置为所需的值。根据使用情况，可能需要30-60秒。

如何在rvest中设置超时

1 个答案:

示例

循环使用（例如，如果超时则跳到下一个）