如何使用rvest收集此表中的所有URL?

时间:2019-06-23 02:11:58

标签: html r web-scraping rvest

我正在尝试获取表here

的第一列中的所有链接

我只能得到第一个链接/行。

library(rvest)
        url <- "https://di.hkex.com.hk/di/NSSrchPersonList.aspx?sa1=pl&scsd=01/01/2018&sced=31/12/2018&pn=wing&src=MAIN&lang=EN"   

        l <-    wahis.session %>%
          html_nodes(xpath = '//*[@id="grdPaging"]') %>%
          map_chr(~html_attr(html_node(., "a"), "href"))

        l <- as.data.frame(l)

2 个答案:

答案 0 :(得分:1)

rvest支持nth-of-type伪类css选择器,因此您可以对具有指定ID的表的第一列tds的子a标签使用以下内容

library(rvest)
url <- "https://di.hkex.com.hk/di/NSSrchPersonList.aspx?sa1=pl&scsd=01/01/2018&sced=31/12/2018&pn=wing&src=MAIN&lang=EN"   
links <- url %>%
  read_html() %>%
  html_nodes("#grdPaging td:nth-of-type(1) a") %>%
  html_attr("href")

答案 1 :(得分:0)

一种选择是收集表的所有锚点节点并获取所有href属性。

library(rvest)
url <- "https://di.hkex.com.hk/di/NSSrchPersonList.aspx?sa1=pl&scsd=01/01/2018&sced=31/12/2018&pn=wing&src=MAIN&lang=EN"   

url %>%
   read_html() %>%
   html_nodes(xpath = '//*[@id="grdPaging"]') %>%
   html_nodes("a") %>%
   html_attr("href")

# [1] "NSNoticePersonList.aspx?sa2=np&scpid1=35225&scpid3=0&scpid2=67774&sa1=pl&scsd=01%2f01%2f2018&sced=31%2f12%2f2018&pn=wing&src=MAIN&lang=EN&" 
# [2] "NSNoticePersonList.aspx?sa2=np&scpid1=30212&scpid3=0&scpid2=4677&sa1=pl&scsd=01%2f01%2f2018&sced=31%2f12%2f2018&pn=wing&src=MAIN&lang=EN&"  
# [3] "NSNoticePersonList.aspx?sa2=np&scpid1=32746&scpid3=0&scpid2=8439&sa1=pl&scsd=01%2f01%2f2018&sced=31%2f12%2f2018&pn=wing&src=MAIN&lang=EN&"  
#.....