我有以下xml:
parsed <-
<div class="Matches">
<div class="Match">
<div class="MatchType">Singles Match</div>
<div class="MatchResults">
<a href="?id=2&nr=11408&name=Jason+Jordan">Jason Jordan</a> (w/<a href="?id=2&nr=2250&name=Seth+Rollins">Seth Rollins</a>) defeats <a href="?id=2&nr=257&name=Cesaro">Cesaro</a> (w/<a href="?id=2&nr=2641&name=Sheamus">Sheamus</a>) (13:15)</div>
</div>
<div class="Match">
<div class="MatchRecommended">[<span class="TextHighlight"><a href="?id=111&nr=9099">Recommended, Meltzer: ***3/4, CAGEMATCH users: <span class=" Rating Color7">7.17</span></a></span>]</div>
<div class="MatchType">
<a href="?id=5&nr=16">WWE Intercontinental Title</a> Match</div>
<div class="MatchResults">
<a href="?id=2&nr=9967&name=Roman+Reigns">Roman Reigns</a> (c) defeats <a href="?id=2&nr=676&name=Samoa+Joe">Samoa Joe</a> (24:50) </div>
我正在尝试拉出“MatchRecommended”类的部分,并为那些没有“MatchRecommended”类的孩子列出“NA”。
我认为我必须使用xpathSApply和xmlChildren来提取相关数据,但是使用下面的代码,我只能获得NAs:
xpathSApply(parsed, "//*[(@class = 'Match')]", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href')))
[1] NA NA NA NA NA NA NA
理想情况下,结果如下:
[1] NA "Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17"
有关如何做到这一点的任何想法?
答案 0 :(得分:0)
我会获得Match节点,然后查询节点集使用前导&#34;。&#34;所以它相对于当前节点。
parsed <- xmlParse('<div...rest of your XML plus two missing div tags')
nodes <- getNodeSet(parsed, "//div[(@class = 'Match')]")
x <- lapply(nodes, xpathSApply, ".//div[(@class = 'MatchRecommended')]", xmlValue, trim=TRUE)
x
[[1]]
list()
[[2]]
[1] "[Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17]"
有几种方法可以用NA替换该空列表。
sapply(x, function(y) ifelse(length(y)==0, NA, y))
[1] NA "[Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17]"
您也可以使用xml2
包,因为它会返回NAs而不是空列表。
library(xml2)
parsed <- read_xml('<div...')
nodes <- xml_find_all(parsed, "//div[(@class = 'Match')]")
sapply(nodes, function(x) xml_text( xml_find_first(x, ".//div[(@class = 'MatchRecommended')]"), trim=TRUE))