使用部分匹配在R中解析HTML

时间:2017-01-17 01:54:02

标签: html r xml web-scraping

我需要解析以下HTML文档:

<span class="revision-gradient shadowed">90</span>
<span class="revision-gradient not_shadowed">75</span>
<span class="revision-gradient shadowed">85</span>
<span class="revision-gradient blurred">60</span>

要返回列表:

[90, 75, 85, 60]

我通常会使用此代码,但我不确定如何解析部分匹配:

document <- htmlParse(url)
myList <- unlist(lapply(document['//span[@class="revision-gradient"]'],xmlValue))

2 个答案:

答案 0 :(得分:1)

您可以使用XML::xpathSApply

myList <- xpathSApply(document, "//span", xmlValue)

如果你有更多的跨度路径,以下将更加健壮

myList <- unlist(xpathSApply(document, "//span", function(x) {
    if (grepl("revision-gradient", xmlGetAttr(x, "class"))) {
        return(xmlValue(x))
    }
    NULL
}))

HTH

答案 1 :(得分:1)

library(rvest)

pg <- read_html('
<span class="revision-gradient shadowed">90</span>
<span class="revision-gradient not_shadowed">75</span>
<span class="revision-gradient shadowed">85</span>
<span class="revision-gradient blurred">60</span>
')

html_nodes(pg, "span.revision-gradient") %>% 
  html_text()
## [1] "90" "75" "85" "60"

html_nodes(pg, xpath=".//descendant-or-self::span[@class and 
           contains(concat(' ', normalize-space(@class), ' '), 
           ' revision-gradient ')]") %>% 
  html_text()

如果你被困在XML - 土地:

library(XML)

doc <- htmlParse('
<span class="revision-gradient shadowed">90</span>
<span class="revision-gradient not_shadowed">75</span>
<span class="revision-gradient shadowed">85</span>
<span class="revision-gradient blurred">60</span>
')

xpathSApply(doc, "//descendant-or-self::span[@class and 
            contains(concat(' ', normalize-space(@class), ' '), 
            ' revision-gradient ')]", xmlValue)

如果你想要数字值,只需在向量上调用as.numeric()