Question

我的最终目标是能够从该页面及其后续pages中获取所有310篇文章，并通过此功能运行它：

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
library(lubridate)
library(dplyr)

scrape_docs <- function(URL){
  doc <- read_html(URL)

  speaker <- html_nodes(doc, ".diet-title a") %>% 
    html_text()

  date <- html_nodes(doc, ".date-display-single") %>%
    html_text() %>%
    mdy()

  title <- html_nodes(doc, "h1") %>%
    html_text()

  text <- html_nodes(doc, "div.field-docs-content") %>%
    html_text()

  all_info <- list(speaker = speaker, date = date, title = title, text = text)

  return(all_info)
}

我认为继续前进的方法是以某种方式创建所需的URL列表，然后通过scrape_docs函数对该列表进行迭代。从目前的情况来看，我很难理解该如何去做。我以为这样的方法会起作用，但是由于以下错误，我似乎缺少某些关键点：

xml_attr不能应用于“字符”类的对象。

source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"

pages <- 4
all_links <- tibble()

for(i in seq_len(pages)){
  page <- paste0(source_col,i) %>%
    read_html() %>%
    html_attr("href") %>%
    html_attr()

  tmp <- page[[1]]

  all_links <- bind_rows(all_links, tmp)
}

all_links

Answer 1

您可以通过获取所有网址

library(rvest)

source_col <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space%20exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100&page=0"

all_urls <- source_col %>%
              read_html() %>%
              html_nodes("td a") %>%
              html_attr("href") %>%
             .[c(FALSE, TRUE)] %>%
              paste0("https://www.presidency.ucsb.edu", .)

现在通过更改source_col中的页码以获取剩余数据来执行相同的操作。

然后可以使用for循环或map提取所有数据。

purrr::map(all_urls, scrape_docs)

在1个URL上测试功能scrape_docs

scrape_docs(all_urls[1])

#$speaker
#[1] "Dwight D. Eisenhower"

#$date
#[1] "1958-04-02"

#$title
#[1] "Special Message to the Congress Relative to Space Science and Exploration."

#$text
#[1] "\n    To the Congress of the United States:\nRecent developments in long-range 
#    rockets for military purposes have for the first time provided man with new mac......

R Webscraping：如何将URLS馈入函数

1 个答案: