我已经建立了一个带有网址和功能的功能。在抓取网页后返回所需的结果。功能如下:
library(httr)
library(curl)
library(rvest)
library(dplyr)
sd_cat <- function(url){
cat <- curl(url, handle = new_handle("useragent" = "myua")) %>%
read_html() %>%
html_nodes("#breadCrumbWrapper") %>%
html_text()
x <- cat[1]
#y <- gsub(pattern = "\n", x=x, replacement = " ")
y <- gsub(pattern = "\t", x=x, replacement = " ")
y <- gsub("\\d|,|\t", x=y, replacement = "")
y <- gsub("^ *|(?<= ) | *$", "", y, perl=T)
z <- gsub("\n*{2,}","",y)
z <- gsub(" {2,}",">",z)
final <- substring(z,2)
final <- substring(final,1,nchar(final)-1)
final
#sample discontinued url: "http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261"
#sample working url: "http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133"
}
此函数在包含多个网址的字符向量上使用sapply正常工作,但如果停止使用单个网址,则该函数将抛出
open.connection错误(x,&#34; rb&#34;):HTTP错误404.
我需要一种方法来跳过已停止的网址,以使该功能正常工作。
答案 0 :(得分:5)
更好的解决方案是使用httr并在响应不正确时故意采取行动:
library(httr)
library(rvest)
sd_cat <- function(url){
r <- GET(url, user_agent("myua"))
if (status_code(r) >= 300)
return(NA_character_)
r %>%
read_html() %>%
html_nodes("#breadCrumbWrapper") %>%
.[[1]] %>%
html_nodes("span") %>%
html_text()
}
sd_cat("http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261")
sd_cat("http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133")
(我也用更好的rvest代替你的正则表达式)
答案 1 :(得分:3)
也许可以使用Camere.where("config->> 'external_http_port' = ? and config->> 'external_host' = ?", 8105, '9.101.225.158')
,而不是for
。你可以毫无问题地使用sapply
。
tryCatch()