我试图从特定网站上搜集一堆信息,其中包含属于软体动物的不同物种的列表。主页为http://emollusks.myspecies.info/taxonomy/term/8
一旦我到达某个特定物种(例如http://emollusks.myspecies.info/taxonomy/term/12257),提取信息本身就不是问题。但是,如果您导航到上面的主页面,您将意识到它包含一个从“软体动物”开始的折叠式菜单。因此,为了到达特定物种,我必须首先手动扩展该菜单,保存.html页面,然后使用XML在R中解析它。我想开发一个从主页面开始的R脚本,并自动扩展所有可能的框,以便我以后可以一次访问每个物种的信息。我不知道从哪里开始。
非常感谢您的协助。
library(rvest)
library(httr)
library(plyr)
library(dplyr)
tinyTaxUrl <- function(ID) {
sprintf('http://emollusks.myspecies.info/tinytax/get/%s', ID)
}
termTaxUrl <- function(ID) {
sprintf('http://emollusks.myspecies.info/taxonomy/term/%s', ID)
}
extractContent <- function(...) {
content(GET(url = tinyTaxUrl(...),
add_headers(Referer = termTaxUrl(...)),
set_cookies(has_js = '1')))[[2]]$data
}
readHtmlAndReturnTaxID <- function(..., verbose = TRUE) {
if(verbose) {
cat(..., '\n')
}
# use 'try' to ensure that failed connection won't break it
pg <- try(read_html(extractContent(...)), silent = TRUE)
while(class(pg)[1] == 'try-error') {
pg <- try(read_html(extractContent(...)), silent = TRUE)
}
taxaList <- pg %>% html_nodes('li > a')
data.frame(taxa = taxaList %>% html_text(),
ids = basename(taxaList %>% html_attr('href')),
stringsAsFactors = FALSE)
}
startTaxaID <- '8'
eBivalvia <- readHtmlAndReturnTaxID(startTaxaID)
eBivalvia2 <- ldply(eBivalvia$ids, readHtmlAndReturnTaxID)
n <- 1
while(nrow(eBivalvia2) > 0) {
cat(n, '\n')
n <- n + 1
eBivalvia <- rbind(eBivalvia, eBivalvia2)
eBivalvia2 <- ldply(eBivalvia2$ids, readHtmlAndReturnTaxID)
}
eBivalvia$urls <- termTaxUrl(eBivalvia$ids)
答案 0 :(得分:3)
打开开发人员工具,勾选[+]时查看XHR请求。您可以将&#34;复制为cURL&#34;直接访问我的curlconverter
包,它会帮助您将其转换为httr
个请求。然后,您将能够从数据元素XHR响应中获取其他物种及其URL:
library(curlconverter)
library(rvest)
cURL <- "curl 'http://emollusks.myspecies.info/tinytax/get/8' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Cookie: has_js=1' -H 'Connection: keep-alive' -H 'Referer: http://emollusks.myspecies.info/taxonomy/term/8' --compressed"
req <- make_req(straighten(cURL))
pg <- read_html(httr::content(req[[1]](), as="parsed")[[2]]$data)
html_nodes(pg, "li > a")
## {xml_nodeset (10)}
## [1] <a href="/taxonomy/term/12" class="">Conchifera</a>
## [2] <a href="/taxonomy/term/18" class="">Placophora</a>
## [3] <a href="/taxonomy/term/9" class="">Bivalvia</a>
## [4] <a href="/taxonomy/term/10" class="">Caudofoveata</a>
## [5] <a href="/taxonomy/term/11" class="">Cephalopoda</a>
## [6] <a href="/taxonomy/term/14" class="">Gastropoda</a>
## [7] <a href="/taxonomy/term/16" class="">Monoplacophora</a>
## [8] <a href="/taxonomy/term/19" class="">Polyplacophora</a>
## [9] <a href="/taxonomy/term/20" class="">Scaphopoda</a>
## [10] <a href="/taxonomy/term/21" class="">Solenogastres</a>
以下是httr
生成的curlconverter
来电的修改版本:
library(httr)
GET(url = "http://emollusks.myspecies.info/tinytax/get/8",
add_headers(Referer = "http://emollusks.myspecies.info/taxonomy/term/8"),
set_cookies(has_js = "1"))
应该可以(最终)了解URL模式并获得您需要的任何内容(您可以浏览Drupal tinytax
模块文档,以了解它是如何工作的。)