R中的Web抓取:同一元素多次被抓取。我该如何解决?

时间:2019-10-17 15:14:02

标签: r web-scraping rvest

我正在尝试从dutch train disruptions website抓取一些URL。问题在于,在每个页面上,第一个URL都会被抓取7次。 HTML仅包含该URL一次,因此我不理解为什么多次将该URL抓取。

问题在每个页面上都会以相同的方式发生:每次,第一个URL都会被抓取7次,而在页面的其余部分仅被抓取一次。

我正在使用以下脚本:

library(tidyverse)
library(rvest)

scrape_css_attr <- function(css,group,attribute,html_page){
  txt <- html_page %>%
    html_nodes(group) %>%
    lapply(.%>% html_nodes(css) %>% html_attr(attribute) %>% ifelse(identical(.,character(0)),NA,.)) %>%
    unlist()
  return(txt)
}

get_element_data <- function(link){  
  if(!is.na(link)){
    html <- read_html(link)
    Sys.sleep(2)
    datum <- html %>%
      html_node(".disruption-cause") %>%
      html_text()
    return(tibble(datum=datum))
  }
}

get_elements_from_url <- function(url){
  html_page <- read_html(url)
  Sys.sleep(2)
  element_urls <- scrape_css_attr(".resolved","div","href",html_page)
  element_urls <- element_urls[!is.na(element_urls)]
  element_urls <- paste0("https://www.rijdendetreinen.nl", element_urls)
  element_data_detail <- element_urls %>%
    map(get_element_data) %>%
    bind_rows()
  elements_data <- tibble(element_urls=element_urls)
  elements_data_overview <- elements_data[complete.cases(elements_data[,1]), ]
  return(bind_cols(elements_data_overview,element_data_detail))
}

scrape_write_table <- function(url){
  list_of_pages <- str_c(url, 1)
  list_of_pages %>%
    map(get_elements_from_url) %>%
    bind_rows()
}

trainDisruptions <- scrape_write_table("https://www.rijdendetreinen.nl/storingen?lines=&reasons=&date_before=31-12-2018&date_after=01-01-2018&page=")

View(trainDisruptions)

0 个答案:

没有答案
相关问题