我有一个我想要抓的网站列表,fe
review_links <- c("https://www.filmtotaal.nl/recensie/12882", "https://www.filmtotaal.nl/r")
在链接上我想执行以下功能:
read_txt <- function(a_review_link){
read_html(review_link)
txt <- pg %>% html_nodes(xpath = '//div[@class="text"]//text()') %>%
html_text %>% trimws %>%
grep('^[a-zA-Z]+:|\\|$|^[0-9]*$', .,
invert = TRUE, value = TRUE) %>%
paste(collapse = ' ')
}
然而,当我使用如下函数循环遍历列表时:
for(review_link in review_links){
read_txt(review_link
}
我收到错误。因此我现在正在尝试一些错误处理。但是,当我这样做时:
for(review_link in review_links){
tryCatch(read_txt(test_error), error=function(e) return ("No valid URL"))
}
我确实得到了我期望的输出(第二个链接应该弹出错误)。对这里出了什么问题的想法?
答案 0 :(得分:1)
我查看了tryCatch的文档,这就是我想出来的。这是我第一次使用tryCatch。
review_links <- c("https://www.filmtotaal.nl/recensie/12882", "https://www.filmtotaal.nl/r")
read_txt <- function(a_review_link){
tryCatch( pg <- read_html(a_review_link), error = function(e) e, {
txt <-
pg %>%
html_nodes(xpath = '//div[@class="text"]//text()') %>%
html_text %>%
trimws %>%
grep('^[a-zA-Z]+:|\\|$|^[0-9]*$', .,invert = TRUE, value = TRUE) %>%
paste(collapse = ' ')
})
}
for(review_link in review_links){
print(read_txt(review_link))
}
答案 1 :(得分:0)
此代码在我的R上正确运行:
library(rvest)
review_links <- c("https://www.filmtotaal.nl/recensie/12882",
"https://www.filmtotaal.nl/recensie/12883")
read_txt <- function(a_review_link) {
pg <- read_html(review_link)
txt <- pg %>% html_nodes(xpath = '//div[@class="text"]//text()') %>%
html_text %>% trimws %>%
grep('^[a-zA-Z]+:|\\|$|^[0-9]*$', ., invert = TRUE, value = TRUE) %>%
paste(collapse = ' ')
}
lst <- vector(length(review_links), mode="list")
k <- 1
for(review_link in review_links) {
lst[[k]] <- read_txt(review_links)
k <- k+1
}
lst[[1]]
# [1] "Cast : Het Hongaarse lichtabsurdistische liefdesdrama On Body and Soul sleepte ...
lst[[2]]
# [1] "Cast : Janet heeft er hard voor geknokt, maar nu het gelukt is mag ze het ...