使用XML关联子节点和父节点的值

时间:2015-04-08 01:28:58

标签: xml r

使用R&R的XML包时,如何保存与该节点关联的特定节点的数据,例如在同一列表中?我试图将从Web上抓取的数据转换为数据框,并将相关信息分组为行。有<span>个元素没有类属性来区分它们,并且每个相关组(数据框的行)中可能有一个或两个<span>

以下是我保存为html_example.html的一些示例html。

<!DOCTYPE html>
<html>
    <body>
        <div class="foo">
            <div class="fooname">Name of 1st foo</div>
            <span>1st span in 1st foo</span>
            <span>2nd span in 1st foo</span>
        </div>

        <div class="foo">
            <div class="fooname">Name of 2nd foo</div>
            <span>Only 1 span in 2nd foo</span>
        </div>
    </body>
</html>

这是当前的解析代码和输出:

library(XML)

html <- readLines("html_example.html")
parse <- htmlParse(html)

fooname <- xpathSApply(parse, "//div[@class='foo']/div[@class='fooname']"
    , xmlValue)
print(fooname)

    # > print(fooname)
    # [1] "Name of 1st foo" "Name of 2nd foo"

span <- xpathSApply(parse, "//div[@class='foo']/span"
    , xmlValue)
print(span)

    # >     print(span)
    # [1] "1st span in 1st foo"    "2nd span in 1st foo"    "Only 1 span in 2nd foo"

目前无法将&#34; fooname&#34;的价值联系起来。和&#34; span&#34;。有没有办法让抓取输出看起来像这样?

foo1 <- list(fooname[1], span[1:2])
foo2 <- list(fooname[2], span[3])
list1 <- list(foo1, foo2)
list1

    # > mylist
    # [[1]]
    # [[1]][[1]]
    # [1] "Name of 1st foo"
    # 
    # [[1]][[2]]
    # [1] "1st span in 1st foo" "2nd span in 1st foo"
    # 
    # 
    # [[2]]
    # [[2]][[1]]
    # [1] "Name of 2nd foo"
    # 
    # [[2]][[2]]
    # [1] "Only 1 span in 2nd foo"

最终,在抓取过程中没有必要,我想创建一个看起来像这样的数据框。关于新闻的相关讨论here

FooNames <- c(fooname[1], fooname[2])
Span1 <- c(span[1], span[3])
Span2 <- c(span[2], NA)
df <- data.frame(FooNames, Span1, Span2, stringsAsFactors = FALSE)
df

    # > df
    #          FooNames                  Span1               Span2
    # 1 Name of 1st foo    1st span in 1st foo 2nd span in 1st foo
    # 2 Name of 2nd foo Only 1 span in 2nd foo                <NA>

1 个答案:

答案 0 :(得分:2)

您可以将函数应用于每个感兴趣的节点(在本例中为div[class="foo"])。一个简单的示例采用每个节点并将xmlValue应用于div class="fooname"span子节点。然后它将这些值作为data.frame返回。您可以将生成的data.frames绑定在一起以获得所需的结果:

'<!DOCTYPE html>
<html>
    <body>
        <div class="foo">
            <div class="fooname">Name of 1st foo</div>
            <span>1st span in 1st foo</span>
            <span>2nd span in 1st foo</span>
        </div>

        <div class="foo">
            <div class="fooname">Name of 2nd foo</div>
            <span>Only 1 span in 2nd foo</span>
        </div>
    </body>
</html>' -> appData
doc <- htmlParse(appData)
myFunc <- function(x){
  div <- xpathSApply(x, "./div[@class='fooname']", fun = xmlValue)
  span <- xpathSApply(x, "./span", fun = xmlValue)
  data.frame(FooNames = div, Span1 = span[1], Span2 = span[2])
}
res <- doc["//*/div[@class='foo']", fun = myFunc]

> do.call(rbind, res)
FooNames                  Span1               Span2
1 Name of 1st foo    1st span in 1st foo 2nd span in 1st foo
2 Name of 2nd foo Only 1 span in 2nd foo                <NA>