我正在尝试为多个网站创建一些新闻稿的整洁数据框。大多数网站都是结构化的,因此有一个带有简短介绍的标题主页,然后是指向主要文章的链接。我想从主页上抓取所有主要文章。这是我的方法。任何帮助将不胜感激。
library(tidyverse)
library(rvest)
library(xml2)
url_1 <- read_html("http://lifepointhealth.net/news")
## seems to grab the lists
url_1 %>%
html_nodes("li") %>%
html_text() %>%
str_squish() %>%
str_trim() %>%
enframe()
# A tibble: 80 x 2
name value
<int> <chr>
1 1 Who We Are Our Company Mis…
2 2 Our Company Mission, Visio…
3 3 Mission, Vision, Values an…
4 4 Giving Quality a Voice
5 5 How We Operate
6 6 Leadership
7 7 Awards
8 8 20th Anniversary
9 9 Our Communities Explore Ou…
10 10 Explore Our Communities
# … with 70 more rows
# this grabs the titles but there should be many more
url_1 %>%
html_nodes("li .title") %>%
html_text() %>%
str_squish() %>%
str_trim() %>%
enframe()
# A tibble: 20 x 2
name value
<int> <chr>
1 1 LifePoint Health Names Elm…
2 2 David Steitz Named Chief E…
3 3 LifePoint Health Receives …
4 4 Thousands of Top U.S. Hosp…
5 5 Conemaugh Nason Medical Ce…
6 6 Vicki Parks Named CEO of W…
7 7 LifePoint Health Honors Ka…
8 8 Ennis Regional Medical Cen…
9 9 LifePoint Health Business …
10 10 LifePoint Health and R1 RC…
答案 0 :(得分:1)
点击开发工具的网络标签,您会看到每次点击“加载更多”时页面都会向 http://lifepointhealth.net/api/posts
发送请求。模仿下面的请求,您将能够抓取所有 332 个帖子详细信息:
items <- httr::POST(
"http://lifepointhealth.net/api/posts",
config = httr::add_headers(`Content-Type` = "application/x-www-form-urlencoded"),
body = "skip=0&take=332&Type=News&tagFilter=",
encode = "multipart"
) %>%
httr::content() %>%
.$Items
items <- dplyr::bind_rows(lapply(items, function(f) {
as.data.frame(Filter(Negate(is.null), f))
}))