在R中自动展开html可折叠列表

时间:2016-05-26 00:06:07

标签: html r xml web-scraping

我试图从特定网站上搜集一堆信息,其中包含属于软体动物的不同物种的列表。主页为http://emollusks.myspecies.info/taxonomy/term/8

一旦我到达某个特定物种(例如http://emollusks.myspecies.info/taxonomy/term/12257),提取信息本身就不是问题。但是,如果您导航到上面的主页面,您将意识到它包含一个从“软体动物”开始的折叠式菜单。因此,为了到达特定物种,我必须首先手动扩展该菜单,保存.html页面,然后使用XML在R中解析它。我想开发一个从主页面开始的R脚本,并自动扩展所有可能的框,以便我以后可以一次访问每个物种的信息。我不知道从哪里开始。

非常感谢您的协助。

以下是基于@hrbrmstr

所选答案的潜在解决方案
library(rvest)
library(httr)
library(plyr)
library(dplyr)

tinyTaxUrl  <-  function(ID) {
    sprintf('http://emollusks.myspecies.info/tinytax/get/%s', ID)
}

termTaxUrl  <-  function(ID) {
    sprintf('http://emollusks.myspecies.info/taxonomy/term/%s', ID)
}

extractContent  <-  function(...) {
    content(GET(url = tinyTaxUrl(...), 
                add_headers(Referer = termTaxUrl(...)), 
                set_cookies(has_js = '1')))[[2]]$data
}

readHtmlAndReturnTaxID  <-  function(..., verbose = TRUE) {
    if(verbose) {
        cat(..., '\n')
    }
    # use 'try' to ensure that failed connection won't break it
    pg  <-  try(read_html(extractContent(...)), silent = TRUE)
    while(class(pg)[1] == 'try-error') {
        pg  <-  try(read_html(extractContent(...)), silent = TRUE)
    }
    taxaList  <-  pg %>% html_nodes('li > a')
    data.frame(taxa             =  taxaList %>% html_text(),
               ids              =  basename(taxaList %>% html_attr('href')),
               stringsAsFactors =  FALSE)
}


startTaxaID  <-  '8'
eBivalvia    <-  readHtmlAndReturnTaxID(startTaxaID)
eBivalvia2   <-  ldply(eBivalvia$ids, readHtmlAndReturnTaxID)
n            <-  1
while(nrow(eBivalvia2) > 0) {
    cat(n, '\n')
    n           <-  n + 1
    eBivalvia   <-  rbind(eBivalvia, eBivalvia2)
    eBivalvia2  <-  ldply(eBivalvia2$ids, readHtmlAndReturnTaxID)
}

eBivalvia$urls  <-  termTaxUrl(eBivalvia$ids)

1 个答案:

答案 0 :(得分:3)

打开开发人员工具,勾选[+]时查看XHR请求。您可以将&#34;复制为cURL&#34;直接访问我的curlconverter包,它会帮助您将其转换为httr个请求。然后,您将能够从数据元素XHR响应中获取其他物种及其URL:

library(curlconverter)
library(rvest)

cURL <- "curl 'http://emollusks.myspecies.info/tinytax/get/8' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Cookie: has_js=1' -H 'Connection: keep-alive' -H 'Referer: http://emollusks.myspecies.info/taxonomy/term/8' --compressed"

req <- make_req(straighten(cURL))
pg <- read_html(httr::content(req[[1]](), as="parsed")[[2]]$data)

html_nodes(pg, "li > a")

## {xml_nodeset (10)}
##  [1] <a href="/taxonomy/term/12" class="">Conchifera</a>
##  [2] <a href="/taxonomy/term/18" class="">Placophora</a>
##  [3] <a href="/taxonomy/term/9" class="">Bivalvia</a>
##  [4] <a href="/taxonomy/term/10" class="">Caudofoveata</a>
##  [5] <a href="/taxonomy/term/11" class="">Cephalopoda</a>
##  [6] <a href="/taxonomy/term/14" class="">Gastropoda</a>
##  [7] <a href="/taxonomy/term/16" class="">Monoplacophora</a>
##  [8] <a href="/taxonomy/term/19" class="">Polyplacophora</a>
##  [9] <a href="/taxonomy/term/20" class="">Scaphopoda</a>
## [10] <a href="/taxonomy/term/21" class="">Solenogastres</a>

以下是httr生成的curlconverter来电的修改版本:

library(httr)

GET(url = "http://emollusks.myspecies.info/tinytax/get/8", 
    add_headers(Referer = "http://emollusks.myspecies.info/taxonomy/term/8"), 
    set_cookies(has_js = "1"))

应该可以(最终)了解URL模式并获得您需要的任何内容(您可以浏览Drupal tinytax模块文档,以了解它是如何工作的。)