使用R进行网页爬取,并显示“查看更多”的连续页面

时间:2018-10-13 23:10:09

标签: r web-scraping rvest

我是R的新手,需要在此网站https://www.healthnewsreview.org/news-release-reviews/

上刮取标题和日期

使用rvest,我能够编写基本代码来获取信息:

url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)

webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)

但是由于该网站最初仅显示10个项目,然后您必须单击“查看更多”,所以我不知道如何抓取整个网站。谢谢!!

1 个答案:

答案 0 :(得分:4)

作为最后的手段,应该引入第三方依赖项。 RSelenium(最初作为r2evans被假定为唯一的解决方案)在当时(包括现在)的绝大多数时间内 并不是必需的。 (对于使用可怕技术(例如SharePoint)的可怕站点,这是必需的,因为在没有浏览器上下文的情况下维护状态的痛苦大于其应有的代价。)

如果我们从主页开始:

library(rvest)

pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")

我们可以获得第一组链接(其中10个):

pg %>%
  html_nodes("div.item-content") %>%
  html_attr("onclick") %>%
  gsub("^window.location.href='|'$", "", .)
##  [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"                
##  [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
##  [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"                                             
##  [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"                        
##  [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"                           
##  [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"                          
##  [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"                                  
##  [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"                                               
##  [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"                                  
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"

我想您想抓取那些^^的内容,因此就可以了。

但是,有一个讨厌“查看更多”按钮。

当您单击它时,它会发出以下POST请求:

manual

使用enter image description here,我们可以将其转换为可调用的httr函数(鉴于该任务的不可能,该函数可能不存在)。我们可以将该函数调用包装在带有分页参数的另一个函数中:

view_more <- function(current_offset=10) {

  httr::POST(
    url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
    httr::add_headers(
      `X-Requested-With` = "XMLHttpRequest"
    ),
    body = list(
      action = "viewMore",
      current_offset = as.character(as.integer(current_offset)),
      page_id = "22332",
      btn = "btn btn-gray",
      active_filter = "latest"
    ),
    encode = "form"
  ) -> res

  list(
    links = httr::content(res) %>%
      html_nodes("div.item-content") %>%
      html_attr("onclick") %>%
      gsub("^window.location.href='|'$", "", .),
    next_offset = current_offset + 4
  )

}

现在,我们可以运行它(因为它默认为第一次“查看更多”点击时发出的10):

x <- view_more()

str(x)
## List of 2
##  $ links      : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
##  $ next_offset: num 14

我们可以将新的偏移量传递给另一个呼叫:

y <- view_more(x$next_offset)

str(y)
## List of 2
##  $ links      : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
##  $ next_offset: num 18

您可以进行艰巨的工作,以抓取初始文章计数(位于主页上),并进行数学运算以将其循环放置并有效地停止。

注意:如果您要进行抓取以归档整个网站(无论是针对他们还是独立网站),因为该网站将在年底到期,那么您应该对此进行评论,对于该用例,我有比这更好的建议任何编程语言的手动编码。有免费的工业“站点保存”框架旨在保存这些类型的垂死资源。如果只需要文章的内容,则可以选择迭代器和自定义抓取工具(但显然是不可能)。

还要注意,4的分页增量是按字面意义按下按钮时网站所做的,因此,这只是模仿了该功能。