rvest
包的简单应用:我试图从网站上抓取一类html链接。
此代码为我提供了来自网站的正确节点:
library(rvest)
library(magrittr)
foo <- "http://www.realclearpolitics.com/epolls/2010/house/2010_elections_house_map.html" %>%
read_html
另外,我使用css选择器找到了正确的节点:
foo %>%
html_nodes("#states td") %>%
extract(2:4)
返回
{xml_nodeset (3)}
[1] <td>\n <a class="dem" href="/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html">\n <span>AR4</span>\n </a>\n</td>
[2] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html">\n <span>CT1</span>\n </a>\n</td>
[3] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html">\n <span>CT2</span>\n </a>\n</td>
好的,href
属性是我正在寻找的。但是这个
foo %>%
html_nodes("#states td") %>%
extract(2:4) %>%
html_attr("href")
返回
[1] NA NA NA
如何访问基础链接?
答案 0 :(得分:1)
使用xml_children()
,您可以:
foo %>%
html_nodes('#states td') %>%
xml_children %>%
html_attr('href') %>%
extract(2:4)
返回:
[1] "/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html"
[2] "/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html"
[3] "/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html"
您可以将extract
放在html_attr
前面,也可能其他一些序列也可以使用。