R:rvest库提取嵌套节点内容

时间:2018-11-02 02:37:43

标签: r web-scraping rvest

这是期刊页面的链接:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9
我正在尝试获取以下内容:作者关联(所有作者),通讯作者和通讯作者的电子邮件。注意:假设相应的作者是本文顶部“作者”部分中列出的最后一位作者。我已经使用SelectorGadget为其他元素(例如摘要和发布日期)标识了一些标签,但是我似乎无法弄清楚如何获得这三个标签。以下是我获得作者作为字符向量的代码:

#url is the url for the list of articles on a particular page
s <- html_session(url)<br >
page <- s %>% follow_link(art) %>% read_html()   <br > 
str_replace_all(str_squish(page %>% html_nodes(".AuthorName") %>% html_text()), "[0-9]|Email author", "")<br >

这将返回所有相关作者的向量,在这种情况下,每个作者的长度为8。但是现在我需要点击他们名字上的链接以获取从属关系以及他们的电子邮件。我确定我需要的所有代码都摆在我面前,但是由于我对R和Web抓取还很陌生,所以我有点迷茫(必须为我当前的项目快速学习)。

更新

下面的答案是完美的。

1 个答案:

答案 0 :(得分:1)

我不确定电子邮件地址是否始终与作者在最后位置匹配。 因为当我打开Chrome视图源时,我发现该电子邮件地址某种程度上位于一个独立列表的下方。

library(rvest)
#> 载入需要的程辑包:xml2
library(data.table)
library(tidyverse)
xml <- read_html('https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9')


xml %>% 
    html_nodes('.EmailAuthor') %>% 
    html_attr('href')
#> [1] "mailto:liuj@cs.uky.edu"
    # get email address

xml %>% 
    html_nodes('.AuthorName') %>% 
    html_text
#> [1] "Ye<U+00A0>Yu"  "Jinpeng<U+00A0>Liu" "Xinan<U+00A0>Liu" "Yi<U+00A0>Zhang"
#> [5] "Eamonn<U+00A0>Magner" "Erik<U+00A0>Lehnert" "Chen<U+00A0>Qian" "Jinze<U+00A0>Liu"
    # get name

data.table(
    name = xml %>% 
        html_nodes('meta') %>% 
        html_attr('name')
    ,content = xml %>% 
        html_nodes('meta') %>% 
        html_attr('content')
) %>% 
    # extract both name and affiliatation, because make show they are matched.
    filter(name %in% c('citation_author_institution')) %>% 
    select(content)
#>                                                                                    content
#> 1                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 2                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 3                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 4                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 5                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 6                                               Seven Bridges Genomics Inc, Cambridge, USA
#> 7 Department of Computer Engineering, University of California Santa Cruz, Santa Cruz, USA
#> 8                   Department of Computer Science, University of Kentucky, Lexington, USA

reprex package(v0.2.1)于2018-11-02创建