如果条件使用json抓取多个链接

时间:2019-08-23 10:04:51

标签: r json web-scraping rvest

我正在使用json抓取多个(1000)链接的内容。但是,某些链接不能以json格式工作,因此没有要刮擦的内容。因此,当找到这些链接之一时,我的代码停止工作。

我尝试使用TryCatch来避免该错误,但似乎不起作用

这是我正在使用的代码

library(jsonlite)
library(rvest)

lapply(links_jason[1:6], function(x) {
  tryCatch(
    {
  json_data <- read_html(x) %>% html_text()%>%
    jsonlite::fromJSON(.)%>%
    select(1)
    },
  error = function(cond) return(NULL),
  finally = print(x)
  )
})

这是我遇到的问题

Debug location is approximate beacuse the source is not available

以下是我要抓取的链接的一些示例

链接1、2和6正常工作。 3、4和5需要避免

> head(links_jason)
[1] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
[2] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
[3] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
[4] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
[5] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
[6] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json"

我也曾尝试使用if语句,但没有结果。有人可以帮忙吗?谢谢!

1 个答案:

答案 0 :(得分:1)

使用jsonlite直接读取并测试返回长度

library(jsonlite)
library(rvest)
library(magrittr)

links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")


lapply(links_jason[1:6], function(x) {

      json_data <- jsonlite::read_json(x)
      if(length(json_data)>0){
        print(x)
      }
}

或类似的东西

library(jsonlite)
library(rvest)
library(magrittr)

links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")


lapply(links_jason[1:6], function(x) {
      json_data <- jsonlite::read_json(x)
      if(length(json_data)==0){
        json_data <- NA}
      else{
          print('doing something with json_data')
        }
      })