Question

我目前正试图从特定网站（http://www.faunaeur.org/?no_redirect=1）中搜集生物多样性数据。我已经设法获得了一些结果，但没有像我希望的那样自动化...... 第一部分已经完成，正在浏览网站：

设置Rselenium：

library(RSelenium)
download.file("https://github.com/mozilla/geckodriver/releases/download/v0.11.1/geckodriver-v0.11.1-win64.zip",destfile="./gecko.zip")
unzip("./gecko.zip",exdir=".",overwrite=T)
checkForServer(update=T)
selfserv = startServer()
mybrowser1 = remoteDriver(browserName="firefox",extraCapabilities = list(marionette = TRUE))
mybrowser1$open()

然后开始我的浏览（这将是巴利阿里群岛的一个例子）：

mybrowser1$navigate("http://www.faunaeur.org/distribution.php?current_form=species_list")
mybrowser1$findElement(using="xpath","//select[@name='taxon_rank']/option[@value='7']")$clickElement()    # Class
mybrowser1$findElement(using="xpath","//input[@name='taxon_name']")$sendKeysToElement(list('Oligochaeta'))  # Oligochète
mybrowser1$findElement(using="xpath","//select[@name='region']/option[@value='15']")$clickElement()
mybrowser1$findElement(using="xpath","//input[@name='include_doubtful_presence']")$clickElement()
mybrowser1$findElement(using="xpath","//input[@name='submit2']")$clickElement()

从这一点开始，我可以使用以下方法下载20个亚种的xls文件：

mybrowser1$findElement(using = "xpath", "//a[@href='JavaScript:document.export_species_list.submit()']")$clickElement()

但那不是我想要的，我不想使用＆＃34;点击＆＃34;。是否可以直接在我的R环境中从此JavaScript链接下载文件，或者使用Rselenium直接从网页的源代码中删除20个亚种的表格？

我尝试了这两种解决方案，但这是一个僵局......最大的问题是该页面是一个临时页面或结果页面＆＃39;似乎我无法在其中找到与我需要的表对应的@value，@ id，@ name或@class。

解决方案的任何线索暗示了通过R自动执行此操作的方法？我需要这种形式，因为脚本必须由那些需要自己创建结果的人来运行。提前致谢！

Answer 1

如果您只想要网站上显示的表格，可以通过httr通过require(rvest) require(httr) res <- POST("http://www.faunaeur.org/species_list.php", encode = "form", body = list(selected_regions="15", show_what="species list", referring_page="distribution", taxon_rank="7", taxon_name="Oligochaeta", region="15", include_doubtful_presence="yes", submit2="Display Species", show_what="species list", species_or_higher_taxa="species")) doc <- res %>% read_html dat <- doc %>% html_table(fill=TRUE, ) %>% .[[9]] colnames(dat) <- dat[1,] dat <- dat[-1, ]完成此操作，如下所示：

            Family                      Species / subspecies
2  Acanthodrilidae       Microscolex dubius (Fletscher 1887)
3    Enchytraeidae      Enchytraeus buchholzi Vejdovsky 1878
4    Enchytraeidae     Fridericia berninii Dozsa-Farkas 1988
5    Enchytraeidae            Fridericia caprensis Bell 1947
...
21        Naididae           Aulophorus furcatus (Oken 1815)

这给了你：

dateReceived

Rselenium - 如何从没有id或任何类型名称的网页中抓取数据

1 个答案: