我正在使用getNodeSet()
解析XML文件。假设我有一个书店的XML文件,其中列出了4本不同的书籍,但是对于一本书,标签“authors”缺失。
如果我使用data.nodes.2 <- getNodeSet(data,'//*/authors')
解析标签“authors”的XML,则R返回3个元素的列表。
然而,这并不是我想要的。如何让“getNodeSet()”返回一个包含4个而不是3个元素的列表,即一个元素具有缺失值,其中标记“authors”不存在。
我感谢任何帮助。
library(XML)
file <- "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\r\n<!-- Edited by XMLSpy® -->\r\n<bookstore>\r\n<book category=\"cooking\">\r\n<title lang=\"en\">Everyday Italian</title>\r\n<authors>\r\n<author>Giada De Laurentiis</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>30.00</price>\r\n</book>\r\n<book category=\"children\">\r\n<title lang=\"en\">Harry Potter</title>\r\n<authors>\r\n<author>J K. Rowling</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>29.99</price>\r\n</book>\r\n<book category=\"web\">\r\n<title lang=\"en\">XQuery Kick Start</title>\r\n<authors>\r\n<author>James McGovern</author>\r\n<author>Per Bothner</author>\r\n<author>Kurt Cagle</author>\r\n<author>James Linn</author>\r\n<author>Vaidyanathan Nagarajan</author>\r\n</authors>\r\n<year>2003</year>\r\n<price>49.99</price>\r\n</book>\r\n<book category=\"web\" cover=\"paperback\">\r\n<title lang=\"en\">Learning XML</title>\r\n\r\n<year>2003</year>\r\n<price>39.95</price>\r\n</book>\r\n</bookstore>"
data <- xmlParse(file)
data.nodes.1 <- getNodeSet(data,'//*/book')
data.nodes.2 <- getNodeSet(data,'//*/authors')
# Data
# <?xml version="1.0" encoding="ISO-8859-1"?>
# <!-- Edited by XMLSpy® -->
# <bookstore>
# <book category="cooking">
# <title lang="en">Everyday Italian</title>
# <authors>
# <author>Giada De Laurentiis</author>
# </authors>
# <year>2005</year>
# <price>30.00</price>
# </book>
# <book category="children">
# <title lang="en">Harry Potter</title>
# <authors>
# <author>J K. Rowling</author>
# </authors>
# <year>2005</year>
# <price>29.99</price>
# </book>
# <book category="web">
# <title lang="en">XQuery Kick Start</title>
# <authors>
# <author>James McGovern</author>
# <author>Per Bothner</author>
# <author>Kurt Cagle</author>
# <author>James Linn</author>
# <author>Vaidyanathan Nagarajan</author>
# </authors>
# <year>2003</year>
# <price>49.99</price>
# </book>
# <book category="web" cover="paperback">
# <title lang="en">Learning XML</title>
# <year>2003</year>
# <price>39.95</price>
# </book>
# </bookstore>
答案 0 :(得分:3)
一种选择是使用R的列表处理从每个节点中提取作者
books <- getNodeSet(doc, "//book")
authors <- lapply(books, xpathSApply, ".//author", xmlValue)
authors[sapply(authors, is.list)] <- NA
并使用图书级信息进行扫描
title <- sapply(books, xpathSApply, "string(.//title/text())")
给
> data.frame(Title=rep(title, sapply(authors, length)),
+ Author=unlist(authors))
Title Author
1 Everyday Italian Giada De Laurentiis
2 Harry Potter J K. Rowling
3 XQuery Kick Start James McGovern
4 XQuery Kick Start Per Bothner
5 XQuery Kick Start Kurt Cagle
6 XQuery Kick Start James Linn
7 XQuery Kick Start Vaidyanathan Nagarajan
8 Learning XML <NA>
答案 1 :(得分:1)
您可以使用plyr
库
library(plyr)
> ldply(xpathApply(data, '//book', getChildrenStrings), rbind)
title authors year price
1 Everyday Italian Giada De Laurentiis 2005 30.00
2 Harry Potter J K. Rowling 2005 29.99
3 XQuery Kick Start James McGovernPer BothnerKurt CagleJames LinnVaidyanathan Nagarajan 2003 49.99
4 Learning XML <NA> 2003 39.95
答案 2 :(得分:1)
您还可以尝试xmlToDataFrame用于某些XML
x <-xmlToDataFrame(doc)
如果你不喜欢作者混在一起,你有时可以通过模式匹配来解决这个问题
x$authors <- gsub("([a-z]{2})([A-Z])", "\\1, \\2", x$authors)
x
title authors year price
1 Everyday Italian Giada De Laurentiis 2005 30.00
2 Harry Potter J K. Rowling 2005 29.99
3 XQuery Kick Start James McGovern, Per Bothner, Kurt Cagle, James Linn, Vaidyanathan Nagarajan 2003 49.99
4 Learning XML <NA> 2003 39.95
其他选项是遍历书籍节点(请参阅?getNodeSet以创建和释放子节点)或遵循Martin的答案(如果您想要4行,请尝试此操作)
authors <- sapply(authors, paste, collapse=",")
data.frame(title, authors)
答案 3 :(得分:1)
这是xml2方法。
该代码可读性强,因此易于维护。
代码
library( xml2 )
#read the xml file
data <- xml2::read_xml( file )
#get all book-titles and store them in a data.frame
books <- data.frame(
title = xml_find_all( data, ".//book/title" ) %>% xml_text(),
stringsAsFactors = FALSE
)
#find all author-nodes
authors <- xml_find_all( data, ".//author" )
#create a dataframe with all authors, an the book they wrote
authors <- data.frame(
#loop over the author-nodes, and get the title from the ancestor-node (i.e. book)
title = xml_find_first( authors, ".//ancestor::book/title") %>% xml_text(),
#get the text from the autor-node
author = xml_text( authors ),
stringsAsFactors = FALSE
)
#left_join the books with the authors
left_join( books, authors, by = "title")
输出
# title author
# 1 Everyday Italian Giada De Laurentiis
# 2 Harry Potter J K. Rowling
# 3 XQuery Kick Start James McGovern
# 4 XQuery Kick Start Per Bothner
# 5 XQuery Kick Start Kurt Cagle
# 6 XQuery Kick Start James Linn
# 7 XQuery Kick Start Vaidyanathan Nagarajan
# 8 Learning XML <NA>
样本数据
file <- '<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited by XMLSpy® -->
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<authors>
<author>Giada De Laurentiis</author>
</authors>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<authors>
<author>J K. Rowling</author>
</authors>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<authors>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
</authors>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web" cover="paperback">
<title lang="en">Learning XML</title>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>'