使用R使用XPATH提取由父项分组的DOM元素值

时间:2016-05-04 14:50:08

标签: r dom

我需要一个列表,其中每个元素都包含来自下面粘贴的XML数据的作者姓名的字符向量,例如:

[[1]]
"Giada De Laurentiis"
[[2]]
"J K. Rowling"
[[3]]
"James McGovern", "Giada De Laurentiis", ...

等。

我从这开始:

my_titles_nodeset <- xpathSApply(doc = my_dom, path = "//book")

我以为每本书都有一个单独的DOM,我想用每本书做这个(我在第三本书上展示操作,跳过apply函数):

  > (title <- my_titles_nodeset[[3]])
    <book category="WEB">
      <title lang="en">XQuery Kick Start</title>
      <author>James McGovern</author>
      <author>Per Bothner</author>
      <author>Kurt Cagle</author>
      <author>James Linn</author>
      <author>Vaidyanathan Nagarajan</author>
      <year>2003</year>
      <price>49.99</price>
    </book> 

我似乎得到了我想要的东西 - 仅限第三本书。所以我想提取作者:

> (author_group <- xpathSApply(title, path = "//book/author", xmlValue))

但我又将所有书籍的所有作者都放在了一堆!见下文:

 > (author_group <- xpathSApply(title, path = "//book/author", xmlValue))
[1] "Giada De Laurentiis"    "J K. Rowling"           "James McGovern"        
[4] "Per Bothner"            "Kurt Cagle"             "James Linn"            
[7] "Vaidyanathan Nagarajan" "Erik T. Ray"    
  1. 我该如何尽可能简单地获得所需的清单(见上)?
  2. 发生了什么,是与XML包有关的问题,还是一般的R或XPATH?
  3. 这是我第一次使用XPATH,我只能在R中编码,请不要使用其他编程语言进行解释。

    XML数据

    <?xml version="1.0" encoding="UTF-8"?>
    
    <bookstore>
    
    <book category="COOKING">
      <title lang="en">Everyday Italian</title>
      <author>Giada De Laurentiis</author>
      <year>2005</year>
      <price>30.00</price>
    </book>
    
    <book category="CHILDREN">
      <title lang="en">Harry Potter</title>
      <author>J K. Rowling</author>
      <year>2005</year>
      <price>29.99</price>
    </book>
    
    <book category="WEB">
      <title lang="en">XQuery Kick Start</title>
      <author>James McGovern</author>
      <author>Per Bothner</author>
      <author>Kurt Cagle</author>
      <author>James Linn</author>
      <author>Vaidyanathan Nagarajan</author>
      <year>2003</year>
      <price>49.99</price>
    </book>
    
    <book category="WEB">
      <title lang="en">Learning XML</title>
      <author>Erik T. Ray</author>
      <year>2003</year>
      <price>39.95</price>
    </book>
    
    </bookstore>  
    

1 个答案:

答案 0 :(得分:0)

您可以分两步检索作者信息。第一步是书籍水平,然后是作者。

listBooks <- xpathApply(my_dom, "//book", saveXML)
listAuthors <- lapply(listBooks, function(book) unlist(xpathSApply(xmlInternalTreeParse(book), "//author/text()", saveXML)))
listAuthors
[[1]]
[1] "Giada De Laurentiis"

[[2]]
[1] "J K. Rowling"

[[3]]
[1] "James McGovern"         "Per Bothner"            "Kurt Cagle"             "James Linn"             "Vaidyanathan Nagarajan"

[[4]]
[1] "Erik T. Ray"