使用Rvest在受密码保护的网站上将URL爬到其他网站

时间:2019-01-17 09:30:40

标签: web-scraping rstudio lapply rvest

我正在尝试在以下受密码保护的网站上抓取多个页面上的链接列表。我已订阅该网站。 https://policinginsight.com/media-monitor/

我写了下面的代码,该代码收集数据框中的所有链接。

library(plyr)
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
library(purrr)
url_base = ("https://policinginsight.com/media-monitor/page")

#number of pages
l_out = 100

urls = paste0(url_base, seq(0, by = 20, length.out = l_out))

#function for nodes
  parse_overview = function (x){
  tibble(Date = html_text(html_nodes(x, ".td- 
  data"), TRUE),
  Text = html_text(html_nodes(x, ".td- 
  link a"), TRUE),
  Link = html_attr(html_nodes(x, xpath = 
  "//td/a"), "href"))
}

    #function to trim space  
      collapse_to_text = function(x){
      p = html_text(x, trim = TRUE)
      p = p[p != ""] #drop empty lines
      paste(p,collapse = "\n")
    }

    #function to get text from links
      parse_result <- function(x){
      tibble(Article = html_text(html_node(x, "p"), 
      trim = TRUE))

    }

     #put links in df links in df
      overview_content = urls %>%
      map(read_html)%>%
      map_df(parse_overview)

这最后一部分是要刮除链接的,但是刮除的结果只是给出一个数据框,其中所有行都显示“ Premium Subscription(Annual)”。

#scrape links in df
  detail_content <- (links_1$links) %>% 
  map(read_html) %>% 
  map_df(parse_result)

 #create df of both
 out <- bind_cols(overview_content, 
 `enter code here`detail_content)

我已使用以下代码通过密码访问该网站,但似乎提供了与上述相同的结果。

#login
url <- "https://policinginsight.com/log-in/"
session <- html_session(url)

form <- html_form(read_html(url))[[1]]

filled_form <- set_values(form,
                          "login[user_email]" = "email",
                          "login[user_password]" = "password")

submit_form(session, filled_form)

url <- jump_to(session, "https://policinginsight.com/media-monitor/")

任何帮助将不胜感激!

谢谢。

0 个答案:

没有答案