Question

我想访问Wikipedia文章的XML输出中的修订详细信息。换句话说，我想要一个data.frame结构，每个revision有一行（据我所知，树结构应该是//page/revision）和一列子列表revision的每个元素（重要的是，不同的revision子列表中可能存在不同的元素。）

数据：

require(XML)
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
          body = "pages=Euroswydd&offset=1&limit=2&action=submit")
stop_for_status(r)
xml <- content(r, "text")
xml_data <- xmlToList(xml)
str(xml_data)

输出

List of 3
$ siteinfo:List of 6
..$ sitename  : chr "Wikipedia"
..$ dbname    : chr "enwiki"
..$ base      : chr "https://en.wikipedia.org/wiki/Main_Page"
..$ generator : chr "MediaWiki 1.27.0-wmf.17"
..$ case      : chr "first-letter"
..$ namespaces:List of 35
... [not of interest] ...
$ page    :List of 5
..$ title   : chr "Euroswydd"
..$ ns      : chr "0"
..$ id      : chr "86146"
..$ revision:List of 7
.. ..$ id         : chr "4028683"
.. ..$ timestamp  : chr "2002-09-16T03:24:52Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "TUF-KAT"
.. .. ..$ id      : chr "8351"
.. ..$ model      : chr "wikitext"
.. ..$ format     : chr "text/x-wiki"
.. ..$ text       :List of 2
.. .. ..$ text  : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him.  Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "163"
.. ..$ sha1       : chr "ivzrvt6jgoga4ndtrdmz5ldg5elfoma"
..$ revision:List of 9
.. ..$ id         : chr "9228569"
.. ..$ parentid   : chr "4028683"
.. ..$ timestamp  : chr "2004-06-11T02:22:33Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "Gtrmp"
.. .. ..$ id      : chr "38984"
.. ..$ minor      : NULL
.. ..$ model      : chr "wikitext"
.. ..$ format     : chr "text/x-wiki"
.. ..$ text       :List of 2
.. .. ..$ text  : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him.  Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "203"
.. ..$ sha1       : chr "kwd09htf87bjc51y2z9ykpnasu7nqle"
$ .attrs  :Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. ..@ .Data: chr [1:3] "http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" "0.10" "en"

现在

我可以使用xml_data[['page']][['revision']]访问第一个修订列表。但是如何访问第二个revision？

Answer 1

Usind rvest你可以做如下的事情：

辅助功能：

parse_nested <- function(x, prefix = ''){
  kids = x %>% xml_children()
  ind = which(sapply(kids, xml_length) != 0)
  if(!length(ind)){
    return(setNames(kids %>% xml_text(), 
                    paste0(prefix,kids %>% xml_name())))
  }
  nested = parse_nested(kids[ind], 
                        prefix = paste0(prefix, kids[ind] %>% xml_name(), "_"))
  unnested = setNames(kids[-ind] %>% xml_text(), 
                      paste0(prefix, kids[-ind] %>% xml_name()))
  as.list(c(unnested, nested))
}

实际代码：

require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
          body = "pages=Euroswydd&offset=1&limit=2&action=submit")

require(rvest)
doc <- read_html(r)
doc %>% 
  html_nodes("revision") %>% 
  lapply(parse_nested) %>% #Parse each revison seperately
  data.table::rbindlist(fill=TRUE) #combine them

结果（a data.table）：

        id            timestamp    model      format ---
1: 4028683 2002-09-16T03:24:52Z wikitext text/x-wiki ---
2: 9228569 2004-06-11T02:22:33Z wikitext text/x-wiki ---

感谢@Arun指出，data.table::rbindlist接受列表。

plyr::rbind.fill可以替代data.table::rbindlist。

XML：访问具有相同名称的嵌套项

1 个答案: