在R中抓取表时,在数据帧中粘贴URL地址

时间:2018-02-05 16:59:12

标签: r web-scraping

我正在试图刮一张桌子,但是我只能用它来粘贴超级链接的值。我想要粘贴URL而不是表中的值。我已经研究了如何为单个超链接做这个,但是我需要经历并获取每个xpath。有更快的方法吗?

这是我一直在使用的代码:

library(rvest)
url <- read_html("https://coinmarketcap.com/coins/views/all/")
cryptocurrencies <- url %>% html_nodes(xpath = '//*[@id="currencies-all"]') 
                           %>% html_table(fill = T)
cryptocurrencies <- cryptocurrencies[[1]]

我怀疑html_nodes函数中有一个参数可以让我粘贴'href'然而我似乎无法锻炼怎么做。感谢

1 个答案:

答案 0 :(得分:1)

首先,您需要使用html_attr()来获取每个音符的属性,在您的情况下,属性为 href

relative_paths <- page %>% 
    html_nodes(".currency-name-container") %>% 
    html_attr("href") #note it is relative path
relative_paths[1:3]
"/currencies/bitcoin/"  "/currencies/ethereum/" "/currencies/ripple/" 

获得相对路径后,您可以使用jump_to()follow_link()函数在每个页面上进行抓取。

#display first three result
for (path in relative_paths) {
    current_session <- html_session("https://coinmarketcap.com/coins/views/all/") %>% 
        jump_to(path)
    #do something here
    print(current_session$url)
}
[1] "https://coinmarketcap.com/currencies/bitcoin/"
[1] "https://coinmarketcap.com/currencies/ethereum/"
[1] "https://coinmarketcap.com/currencies/ripple/

或者可以获得绝对路径:

#or get absolute path
absolute_path <- paste0("https://coinmarketcap.com",relative_paths)
absolute_path[1:3]
[1] "https://coinmarketcap.com/currencies/bitcoin/"  "https://coinmarketcap.com/currencies/ethereum/" "https://coinmarketcap.com/currencies/ripple/"  

最后,您可以将其合并到数据框中。