我正在尝试在以下受密码保护的网站上抓取多个页面上的链接列表。我已订阅该网站。 https://policinginsight.com/media-monitor/
我写了下面的代码,该代码收集数据框中的所有链接。
library(plyr)
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
library(purrr)
url_base = ("https://policinginsight.com/media-monitor/page")
#number of pages
l_out = 100
urls = paste0(url_base, seq(0, by = 20, length.out = l_out))
#function for nodes
parse_overview = function (x){
tibble(Date = html_text(html_nodes(x, ".td-
data"), TRUE),
Text = html_text(html_nodes(x, ".td-
link a"), TRUE),
Link = html_attr(html_nodes(x, xpath =
"//td/a"), "href"))
}
#function to trim space
collapse_to_text = function(x){
p = html_text(x, trim = TRUE)
p = p[p != ""] #drop empty lines
paste(p,collapse = "\n")
}
#function to get text from links
parse_result <- function(x){
tibble(Article = html_text(html_node(x, "p"),
trim = TRUE))
}
#put links in df links in df
overview_content = urls %>%
map(read_html)%>%
map_df(parse_overview)
这最后一部分是要刮除链接的,但是刮除的结果只是给出一个数据框,其中所有行都显示“ Premium Subscription(Annual)”。
#scrape links in df
detail_content <- (links_1$links) %>%
map(read_html) %>%
map_df(parse_result)
#create df of both
out <- bind_cols(overview_content,
`enter code here`detail_content)
我已使用以下代码通过密码访问该网站,但似乎提供了与上述相同的结果。
#login
url <- "https://policinginsight.com/log-in/"
session <- html_session(url)
form <- html_form(read_html(url))[[1]]
filled_form <- set_values(form,
"login[user_email]" = "email",
"login[user_password]" = "password")
submit_form(session, filled_form)
url <- jump_to(session, "https://policinginsight.com/media-monitor/")
任何帮助将不胜感激!
谢谢。