如何使用rvest解析具有不同数量子节点的xml

时间:2016-12-29 13:13:08

标签: r xml rvest

我如何解析这个xml以获得所需的结果?我尝试过的每个配置都将所有五个a href=链接组合在一个向量中,但我需要通过<div class="entry-content">来区分这两个结果。谢谢!

# xml snippet from
# http://www.electionstudies.org/studypages/download/datacenter_all_NoData.html
my_xml <- 
    '<li class="clearfix">
    <article class="entry-item">
    <div class="entry-content">
    <h4 class="entry-title"><img src="../../images/icons/timeseries.png"><a href="../anes_timeseries_cdf/anes_timeseries_cdf.htm">ANES Time Series Cumulative Data File</a> (1948-2012)</h4>
    <p class="indented_text">Data documentation: &nbsp; <a href="../anes_timeseries_cdf/anes_timeseries_cdf.htm"> Study Page</a> &nbsp; <img src="../../images/icons/circle.png" /> &nbsp; <a href="../anes_timeseries_cdf/anes_timeseries_cdf_errata.htm">Errata</a></p>
    </div><!--entry-content-->
    </article><!--entry-item-->
    </li>
    <li class="clearfix">
    <article class="entry-item">
    <div class="entry-content">
    <h4 class="entry-title"><img src="../../images/icons/pilot.png"><a href="../anes_pilot_2016/anes_pilot_2016.htm">ANES 2016 Pilot Study</a></h4>
    <p class="indented_text">Data documentation: &nbsp; <a href="../anes_pilot_2016/anes_pilot_2016.htm">Study Page</a></p>
    </div><!--entry-content-->
    </article><!--entry-item-->
    </li>'

# desired result
list( 
    c( "../anes_timeseries_cdf/anes_timeseries_cdf.htm" , "../anes_timeseries_cdf/anes_timeseries_cdf.htm" , "../anes_timeseries_cdf/anes_timeseries_cdf_errata.htm" ) ,
    c( "../anes_pilot_2016/anes_pilot_2016.htm" , "../anes_pilot_2016/anes_pilot_2016.htm" )
)

1 个答案:

答案 0 :(得分:2)

library(rvest)
library(purrr)

pg <- read_html("http://www.electionstudies.org/studypages/download/datacenter_all_NoData.html")

html_nodes(pg, "article") %>% 
  map(~html_nodes(., "a") %>% 
        html_attr("href"))

您需要忽略第一个列表结果。如果您想要一个使用CSS选择器或XPath忽略结果的解决方案,请告诉我。