循环浏览字母页面(RVest)

时间:2018-11-25 14:37:44

标签: css r web-scraping rvest

花了很多时间在这个问题上,并仔细研究了可用的答案之后,我想继续提出一个新的问题,以解决我使用R和rvest进行网络抓取的问题。我试图完全解决问题,以尽量减少问题

问题 我正在尝试从会议网页中提取作者姓名。作者按姓氏字母顺序隔开;因此,我需要使用一个for循环来调用 follow_link() 25次,以转到每个页面并提取相关的作者文本。

会议网站: https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

我尝试使用rvest在R中使用两种解决方案,但都存在问题。

解决方案1(给链接的来电)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
  tempList[[i]] <- website %>%
  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  
  html_nodes(xpath ='//*[@class = "author"]') %>% 
  html_text()  
}

此代码在某种程度上起作用。以下是输出。它会成功浏览带字母的页面,直到进行H-I转换和L-M转换为止,此时它将捕获错误的页面。

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

解决方案2(通过CSS调用链接) 使用页面上的CSS选择器,每个带有字母的页面都被标识为“ a:nth-​​child(1-26)”。因此,我通过调用该CSS标识符重建了循环。

tempList <- list()
for(i in 2:length(lttrs)){
  tempList[[i]] <- website %>%
    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
    html_nodes(xpath ='//*[@class = "author"]') %>% 
    html_text()
}

这适用于 kindof 。再次遇到某些过渡问题(见下文)

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

特别是,此方法错过了B,C和D。在此步骤中,循环到错误的页面。对于将如何重新配置​​上述代码以正确循环遍历所有26个字母页面,我将不胜感激。

非常感谢您!

1 个答案:

答案 0 :(得分:1)

欢迎来到SO(对第一个问题表示敬意)。

您似乎已经超级幸运,因为该站点的robots.txt条目很多,但并没有试图限制您的工作。

我们可以使用href拉动页面底部字母分页链接中的所有html_nodes(pg, "a[href^='author']")。以下是所有作者的所有论文链接:

library(rvest)
library(tidyverse)

pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
  map_df(~{

    pb$tick()$print() # increment progress bar

    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_attr("href") %>% 
            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
        )
      })
  }) -> author_papers

author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

我不知道您需要从单独的纸质页面上获得什么,所以您可以做到这一点。

由于author_papers数据帧位于此RDS文件:https://rud.is/dl/author-papers.rds中,您也不必等待〜3m,

readRDS(url("https://rud.is/dl/author-papers.rds"))

如果您打算计划刮刮34,983篇论文,请继续注意“不要粗鲁”并使用爬网延迟(参考:https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)。

更新

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
  map_df(~{

    pb$tick()$print() # increment progress bar

    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 
        )
      })
  }) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

您还可以通过以下方式检索:

readRDS(url("https://rud.is/dl/author-presenter.rds"))