我想访问Wikipedia文章的XML输出中的修订详细信息。换句话说,我想要一个data.frame
结构,每个revision
有一行(据我所知,树结构应该是//page/revision
)和一列子列表revision
的每个元素(重要的是,不同的revision
子列表中可能存在不同的元素。)
数据:
require(XML)
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export",
body = "pages=Euroswydd&offset=1&limit=2&action=submit")
stop_for_status(r)
xml <- content(r, "text")
xml_data <- xmlToList(xml)
str(xml_data)
输出
List of 3
$ siteinfo:List of 6
..$ sitename : chr "Wikipedia"
..$ dbname : chr "enwiki"
..$ base : chr "https://en.wikipedia.org/wiki/Main_Page"
..$ generator : chr "MediaWiki 1.27.0-wmf.17"
..$ case : chr "first-letter"
..$ namespaces:List of 35
... [not of interest] ...
$ page :List of 5
..$ title : chr "Euroswydd"
..$ ns : chr "0"
..$ id : chr "86146"
..$ revision:List of 7
.. ..$ id : chr "4028683"
.. ..$ timestamp : chr "2002-09-16T03:24:52Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "TUF-KAT"
.. .. ..$ id : chr "8351"
.. ..$ model : chr "wikitext"
.. ..$ format : chr "text/x-wiki"
.. ..$ text :List of 2
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "163"
.. ..$ sha1 : chr "ivzrvt6jgoga4ndtrdmz5ldg5elfoma"
..$ revision:List of 9
.. ..$ id : chr "9228569"
.. ..$ parentid : chr "4028683"
.. ..$ timestamp : chr "2004-06-11T02:22:33Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "Gtrmp"
.. .. ..$ id : chr "38984"
.. ..$ minor : NULL
.. ..$ model : chr "wikitext"
.. ..$ format : chr "text/x-wiki"
.. ..$ text :List of 2
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "203"
.. ..$ sha1 : chr "kwd09htf87bjc51y2z9ykpnasu7nqle"
$ .attrs :Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. ..@ .Data: chr [1:3] "http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" "0.10" "en"
现在
我可以使用xml_data[['page']][['revision']]
访问第一个修订列表。但是如何访问第二个revision
?
答案 0 :(得分:1)
Usind rvest
你可以做如下的事情:
辅助功能:
parse_nested <- function(x, prefix = ''){
kids = x %>% xml_children()
ind = which(sapply(kids, xml_length) != 0)
if(!length(ind)){
return(setNames(kids %>% xml_text(),
paste0(prefix,kids %>% xml_name())))
}
nested = parse_nested(kids[ind],
prefix = paste0(prefix, kids[ind] %>% xml_name(), "_"))
unnested = setNames(kids[-ind] %>% xml_text(),
paste0(prefix, kids[-ind] %>% xml_name()))
as.list(c(unnested, nested))
}
实际代码:
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export",
body = "pages=Euroswydd&offset=1&limit=2&action=submit")
require(rvest)
doc <- read_html(r)
doc %>%
html_nodes("revision") %>%
lapply(parse_nested) %>% #Parse each revison seperately
data.table::rbindlist(fill=TRUE) #combine them
结果(a data.table
):
id timestamp model format ---
1: 4028683 2002-09-16T03:24:52Z wikitext text/x-wiki ---
2: 9228569 2004-06-11T02:22:33Z wikitext text/x-wiki ---
感谢@Arun指出,data.table::rbindlist
接受列表。
plyr::rbind.fill
可以替代data.table::rbindlist
。