我如何解析这个xml以获得所需的结果?我尝试过的每个配置都将所有五个a href=
链接组合在一个向量中,但我需要通过<div class="entry-content">
来区分这两个结果。谢谢!
# xml snippet from
# http://www.electionstudies.org/studypages/download/datacenter_all_NoData.html
my_xml <-
'<li class="clearfix">
<article class="entry-item">
<div class="entry-content">
<h4 class="entry-title"><img src="../../images/icons/timeseries.png"><a href="../anes_timeseries_cdf/anes_timeseries_cdf.htm">ANES Time Series Cumulative Data File</a> (1948-2012)</h4>
<p class="indented_text">Data documentation: <a href="../anes_timeseries_cdf/anes_timeseries_cdf.htm"> Study Page</a> <img src="../../images/icons/circle.png" /> <a href="../anes_timeseries_cdf/anes_timeseries_cdf_errata.htm">Errata</a></p>
</div><!--entry-content-->
</article><!--entry-item-->
</li>
<li class="clearfix">
<article class="entry-item">
<div class="entry-content">
<h4 class="entry-title"><img src="../../images/icons/pilot.png"><a href="../anes_pilot_2016/anes_pilot_2016.htm">ANES 2016 Pilot Study</a></h4>
<p class="indented_text">Data documentation: <a href="../anes_pilot_2016/anes_pilot_2016.htm">Study Page</a></p>
</div><!--entry-content-->
</article><!--entry-item-->
</li>'
# desired result
list(
c( "../anes_timeseries_cdf/anes_timeseries_cdf.htm" , "../anes_timeseries_cdf/anes_timeseries_cdf.htm" , "../anes_timeseries_cdf/anes_timeseries_cdf_errata.htm" ) ,
c( "../anes_pilot_2016/anes_pilot_2016.htm" , "../anes_pilot_2016/anes_pilot_2016.htm" )
)
答案 0 :(得分:2)
library(rvest)
library(purrr)
pg <- read_html("http://www.electionstudies.org/studypages/download/datacenter_all_NoData.html")
html_nodes(pg, "article") %>%
map(~html_nodes(., "a") %>%
html_attr("href"))
您需要忽略第一个列表结果。如果您想要一个使用CSS选择器或XPath忽略结果的解决方案,请告诉我。