使用 R 跨多个链接进行网络抓取

时间:2021-01-20 21:50:12

标签: html r web-scraping rvest

我正在尝试为多个网站创建一些新闻稿的整洁数据框。大多数网站都是结构化的,因此有一个带有简短介绍的标题主页,然后是指向主要文章的链接。我想从主页上抓取所有主要文章。这是我的方法。任何帮助将不胜感激。

library(tidyverse)
library(rvest)
library(xml2)

url_1 <- read_html("http://lifepointhealth.net/news")


## seems to grab the lists
url_1 %>% 
  html_nodes("li") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe()

# A tibble: 80 x 2
    name value                      
   <int> <chr>                      
 1     1 Who We Are Our Company Mis…
 2     2 Our Company Mission, Visio…
 3     3 Mission, Vision, Values an…
 4     4 Giving Quality a Voice     
 5     5 How We Operate             
 6     6 Leadership                 
 7     7 Awards                     
 8     8 20th Anniversary           
 9     9 Our Communities Explore Ou…
10    10 Explore Our Communities    
# … with 70 more rows


# this grabs the titles but there should be many more
url_1 %>% 
  html_nodes("li .title") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe() 

# A tibble: 20 x 2
    name value                      
   <int> <chr>                      
 1     1 LifePoint Health Names Elm…
 2     2 David Steitz Named Chief E…
 3     3 LifePoint Health Receives …
 4     4 Thousands of Top U.S. Hosp…
 5     5 Conemaugh Nason Medical Ce…
 6     6 Vicki Parks Named CEO of W…
 7     7 LifePoint Health Honors Ka…
 8     8 Ennis Regional Medical Cen…
 9     9 LifePoint Health Business …
10    10 LifePoint Health and R1 RC…

1 个答案:

答案 0 :(得分:1)

点击开发工具的网络标签,您会看到每次点击“加载更多”时页面都会向 http://lifepointhealth.net/api/posts 发送请求。模仿下面的请求,您将能够抓取所有 332 个帖子详细信息:

items <- httr::POST(
  "http://lifepointhealth.net/api/posts",
  config = httr::add_headers(`Content-Type` = "application/x-www-form-urlencoded"),
  body = "skip=0&take=332&Type=News&tagFilter=",
  encode = "multipart"
) %>% 
  httr::content() %>%
  .$Items

items <- dplyr::bind_rows(lapply(items, function(f) {
  as.data.frame(Filter(Negate(is.null), f))
}))