在R中进行Webscraping,访问html节点

时间:2015-11-13 17:58:48

标签: r web-scraping rvest

rvest包的简单应用:我试图从网站上抓取一类html链接。

此代码为我提供了来自网站的正确节点:

library(rvest)
library(magrittr)

foo <- "http://www.realclearpolitics.com/epolls/2010/house/2010_elections_house_map.html" %>% 
            read_html

另外,我使用css选择器找到了正确的节点:

foo %>% 
  html_nodes("#states td") %>% 
  extract(2:4)

返回

{xml_nodeset (3)}
[1] <td>\n  <a class="dem" href="/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html">\n    <span>AR4</span>\n  </a>\n</td>
[2] <td>\n  <a class="dem" href="/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html">\n    <span>CT1</span>\n  </a>\n</td>
[3] <td>\n  <a class="dem" href="/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html">\n    <span>CT2</span>\n  </a>\n</td>

好的,href属性是我正在寻找的。但是这个

foo %>% 
  html_nodes("#states td") %>% 
  extract(2:4) %>% 
  html_attr("href")

返回

[1] NA NA NA

如何访问基础链接?

1 个答案:

答案 0 :(得分:1)

使用xml_children(),您可以:

foo %>% 
  html_nodes('#states td') %>% 
  xml_children %>%
  html_attr('href') %>%
  extract(2:4)

返回:

[1] "/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html"            
[2] "/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html"     
[3] "/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html"

您可以将extract放在html_attr前面,也可能其他一些序列也可以使用。