从rvest结果中挖掘字符串的麻烦

时间:2017-07-04 21:41:17

标签: r regex rvest

我从该页面挖掘物种数据​​,该数据不提供API或可下载列表:

    library(rvest)
      moltres<-1:30
     for (i in moltres){
      temphtml<-read_html(paste0("http://checklist.aou.org/taxa/",i)) %>%
      html_node("section") %>%
      html_text()
      pidgey<-rbind(pidgey, temphtml)
      }

对于列表中的每个项目,结果如下:

"\n  \n      species: \n      Chen canagica (Emperor Goose, Oie empereur)\n  \n\n\n\nNOTE: This is an invalidated taxon. It is a 'synonym' for 12681, which has superseded it.\n\n\n\n\t\n  Compare AOU treatments of \n    \n        Chen canagica,\n in Avibase\n     (1886 to present).\n  \n\n\tSearch for \n    \n        Chen canagica\n at Cornell Birds of North America.\n  \n\n\n\n\n    Annotation: Monotypic.\n\n\n\n\n\n\n\n\n\t"

我试图在每个&#34中提取代码12681;它是一个同义词&#39;为12681&#34; (这些是该物种的最新名称)

我尝试过:

pidgey$sub<-sub(".*synonim (.*?)\\,.*", "\\1", pidgey)

但是我收集的原始列表却很糟糕,最后那个列没有包含我想要的内容,我认为它与文本有关格式, 我非常感谢你的帮助

1 个答案:

答案 0 :(得分:0)

我不确定文本是否因语言环境而发生变化,但这会匹配“synonym”或“synonim”并获得您想要的#:

library(rvest)
library(dplyr)
library(purrr)
library(stringi)

moltres <- 1:30

pb <- progress_estimated(length(moltres))
map_df(moltres, ~{

  pb$tick()$print()

  Sys.sleep(sample(1:5, 1)) # be kind, you have time and the resource is free

  pg <- read_html(sprintf("http://checklist.aou.org/taxa/%s", .x))

  data_frame(
    res = .x, 
    txt = html_node(pg, "section") %>% html_text() 
  )

}) -> xdf

xdf$synon <- stri_match_first_regex(xdf$txt, "'synon[yi]m' for ([[:digit:]]+)")[,2]

select(xdf, synon) %>% 
  print(n=30)
## # A tibble: 30 x 1
##    synon
##    <chr>
##  1  <NA>
##  2  <NA>
##  3  <NA>
##  4  <NA>
##  5  <NA>
##  6  <NA>
##  7  <NA>
##  8  <NA>
##  9  <NA>
## 10  <NA>
## 11  <NA>
## 12  <NA>
## 13  <NA>
## 14  <NA>
## 15  <NA>
## 16 12681
## 17 12691
## 18 12701
## 19  <NA>
## 20  <NA>
## 21  <NA>
## 22  <NA>
## 23  <NA>
## 24  <NA>
## 25  <NA>
## 26  <NA>
## 27  <NA>
## 28  <NA>
## 29  <NA>
## 30  <NA>