使用rvest或xml解析下一页的链接?

时间:2016-11-06 15:22:08

标签: r web-scraping rvest

致力于学习从网站上抓取数据。我一直在玩rvest包,并且掌握了如何用选择器小工具等提取节点。对于一个快速项目我想从飞行网站提取数据,把它变成一个数据框,我以后可以通过有用的航班向我发送电子邮件。任何与我合作的代码都在下面。

library(rvest)
reg = paste("http://www.secretflying.com/usa-deals/") 

#read the text from the flight deal-----------
fly_deals = read_html(reg)
fly_deals = html_nodes(fly_deals, ".entry-title a")
fly_deals = html_text(fly_deals)
fly_deals = as.data.frame(fly_deals)

#add link (not sure how to access the link)
fly_deals$correpsonding_link = 'corresponding_link'

#last step would filter out for NYC
fly_deals = fly_deals[grepl("NEW YORK", fly_deals$fly_deals),]

我现在要做的是访问与每一行(即每个节点)相关联的页面,这样我就可以构建另一个列,其中包含可以直接从我的电子邮件中访问的相应链接。因此,最终产品看起来像这样: enter image description here

感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

尝试:

library(rvest)

deals_link <- "http://www.secretflying.com/usa-deals/"
deals_info <- deals_link %>% read_html() %>%
  html_nodes(".entry-title a")

fly_deals <- data.frame(deals = html_text(deals_info), correpsonding_link = html_attr(deals_info,"href"))

fly_deals[grepl("NEW YORK", fly_deals$deals),]

输出:

 deals                                                                  
 NON-STOP FROM NEW YORK TO CARTAGENA, COLOMBIA FOR ONLY $328 ROUNDTRIP  
 XMAS & NEW YEAR: NEW YORK TO THE TURKS & CAICOS FOR ONLY $231 ROUNDTRIP
 NEW YORK TO BOSTON (& VICE VERSA) FOR ONLY $66 ROUNDTRIP               
 correpsonding_link                                                         
 http://www.secretflying.com/2016/new-york-cartagena-colombia-296-roundtrip/
 http://www.secretflying.com/2016/hot-new-york-turks-caicos-58-one-way/     
 http://www.secretflying.com/2016/new-york-boston-vice-versa-66-roundtrip/ 

我希望这会有所帮助。