使用purrr刮取多页时出错

时间:2020-01-13 17:19:35

标签: r tidyverse rvest purrr

我正在尝试抓取以类似方式设置的多个网页(例如:https://www.foreign.senate.gov/hearings/120314am)。我创建的函数在使用一个URL时有效,但是在尝试映射多个页面时会给我一个错误。

这是该功能的简化版本。

scrape <- function(url){
   url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)

      product <- url %>%
      read_html() %>%
      html_nodes("#main_column")

      names <- product %>%
      html_nodes(".fn") %>%
      html_text() %>% 
      gsub("\\n", "",.) %>% 
      gsub("\\t", "",.) 

      tibble(Witness_Name = names)
    }

将网址存储到对象中并尝试映射时,出现错误。

hearing_name <- c("the-ebola-epidemic-the-keys-to-success-for-the-international-response",
              "120314am")

map_df(hearing_name, scrape)


Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
Expecting a single string value: [type=character; extent=2]. 

我尝试过使用lapply()并重组为一种极简主义的方法,但是没有运气。希望有人能帮助我!

1 个答案:

答案 0 :(得分:1)

函数内部有一个硬编码的助听器名称,而不是'url'

url <- paste0("https://www.foreign.senate.gov/hearings/", hearing_name)

如果我们将其更改为url

scrape <- function(url){
   url <- paste0("https://www.foreign.senate.gov/hearings/", url)

      product <- url %>%
      read_html() %>%
      html_nodes("#main_column")

      names <- product %>%
      html_nodes(".fn") %>%
      html_text() %>% 
      gsub("\\n", "",.) %>% 
      gsub("\\t", "",.) 

      tibble(Witness_Name = names)
    }

代码可以正常工作

out <- map_df(hearing_name, scrape)
dim(out)
#[1] 8 1
out
# A tibble: 8 x 1
#  Witness_Name        
#  <chr>               
#1 Ellen JohnsonSirleaf
#2 PaulFarmer          
#3 AnnePeterson        
#4 PapeGaye            
#5 JavierAlvarez       
#6 DanielRussel        
#7 Richard C.Bush III  
#8 SophieRichardson