我正在学习R来刮大型XML(高达100mb),所以绝对不是专业人士。 XML文件遵循非常严格的格式:每个节点都是涉及卖方(一个或多个),买方(一个或多个)以及正在转移的股票类型(一个或多个)的交易。其中每个都有一个或多个细节(姓名,地址等)。这是一个匿名片段:
<deals>
<deal>
<sellers>
<seller>
<name>Dave</name>
<address>Street name</address>
<city>New York, NY</city>
</seller>
</sellers>
<buyers>
<buyer>
<name>John</name>
<city>Denver, CO</city>
<phone>123456789</phone>
</buyer>
<buyer>
<name>Pete</name>
<address>Avenue name</address>
<city>Kansas, MI</city>
</buyer>
</buyers>
<stocks>
<stock>
<id>GOOGL</id>
</stock>
<stock>
<id>MSFT</id>
<id>0000789019</id>
</stock>
</stocks>
</deal>
<deal>
<sellers>
<seller>
<name>Linda</name>
<city>Philadelphia, PA</city>
<phone>876-543-210</phone>
</seller>
<seller>
<name>Anne</name>
<address>Road name</address>
</seller>
</sellers>
<buyers>
<buyer>
<name>Monica</name>
<address>Alley name</address>
<city>Pensacola, CA</city>
</buyer>
</buyers>
<stocks>
<stock>
<id>INTC</id>
<id>0000050863</id>
</stock>
<stock>
<id>DELL</id>
</stock>
<stock>
<id>HPQ</id>
<id>0000047217</id>
</stock>
</stocks>
</deal>
</deals>
当试图抓取数据时,问题出在“一个或多个”中。现在,我只想创建一个包含交易号(序列号)和卖家信息的数据框,并使用以下代码:
xmldata <- xmlRoot(xmlTreeParse("snippet.xml", useInternalNodes = TRUE))
seller_name <- xpathSApply(xmldata, "//deal/sellers/seller/name", xmlValue)
seller_address <- xpathSApply(xmldata, "//deal/sellers/seller/address", xmlValue)
seller_city <- xpathSApply(xmldata, "//deal/sellers/seller/city", xmlValue)
seller_phone <- xpathSApply(xmldata, "//deal/sellers/seller/phone", xmlValue)
不幸的是,由于两个原因,这不起作用。首先,我无法确定哪个卖家属于哪个交易。其次,由于许多细节是可选的(地址,城市,电话号码),向量的长度各不相同,我无法告诉谁属于街道名称或电话号码:
> seller_name
[1] "Dave" "Linda" "Anne"
> seller_address
[1] "Street name" "Road name"
> seller_phone
[1] "876-543-210"
我尝试使用for循环遍历各个交易,但它太慢了。非常感谢任何帮助,谢谢!!
答案 0 :(得分:1)
创建一个函数Value
,它给出xmlValue
node[[name]]
,但如果结果为NULL则返回NA。使用它创建一个函数getRow
,它检索一行数据。最后将getRow
应用于XML输入,如图所示。
Value <- function(node, name) c(xmlValue(node[[name]]), NA)[1]
getRow <- function(node) sapply(c("name", "address", "city", "phone"), Value, node = node)
t(xpathSApply(xmldata, "//deal/sellers/seller", getRow))
,并提供:
name address city phone
[1,] "Dave" "Street name" "New York, NY" NA
[2,] "Linda" NA "Philadelphia, PA" "876-543-210"
[3,] "Anne" "Road name" NA NA
注意:为了将来的重现性,输入文件snippet.xml
包含:
<?xml version="1.0" encoding="UTF-8"?>
<deals>
<deal>
<sellers>
<seller>
<name>Dave</name>
<address>Street name</address>
<city>New York, NY</city>
</seller>
</sellers>
<buyers>
<buyer>
<name>John</name>
<city>Denver, CO</city>
<phone>123456789</phone>
</buyer>
<buyer>
<name>Pete</name>
<address>Avenue name</address>
<city>Kansas, MI</city>
</buyer>
</buyers>
<stocks>
<stock>
<id>GOOGL</id>
</stock>
<stock>
<id>MSFT</id>
<id>0000789019</id>
</stock>
</stocks>
</deal>
<deal>
<sellers>
<seller>
<name>Linda</name>
<city>Philadelphia, PA</city>
<phone>876-543-210</phone>
</seller>
<seller>
<name>Anne</name>
<address>Road name</address>
</seller>
</sellers>
<buyers>
<buyer>
<name>Monica</name>
<address>Alley name</address>
<city>Pensacola, CA</city>
</buyer>
</buyers>
<stocks>
<stock>
<id>INTC</id>
<id>0000050863</id>
</stock>
<stock>
<id>DELL</id>
</stock>
<stock>
<id>HPQ</id>
<id>0000047217</id>
</stock>
</stocks>
</deal>
</deals>
答案 1 :(得分:0)
这里有两个问题:
dplyr
的{{1}} 答案 2 :(得分:0)
我想我的策略是使用xmlEventParse()
来遍历文件,跟踪唯一标识符以指示不同的状态(例如,&#39; deal&#39;,&#39; buyer&#39; ;,&#39;卖方&#39;)以及与该标识符相关联的名称,地址等。
我使用environment()
来积累信息。
uid <- 0L
key <- new.env(parent=emptyenv())
name <- new.env(parent=emptyenv())
address <- new.env(parent=emptyenv())
xmlEventParse()
允许回调处理节点。回调是作为命名的函数列表提供的,其名称对应于触发回调的xml实体。所以为了开始,我可能会有一些“处理程序”列表。在观察实体时触发。处理程序只增加唯一标识符并记录相应的状态
handlers=list(deal=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "deal"
}, buyer=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "buyer"
}, seller=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "seller"
})
&#39;分枝&#39;就像处理程序一样,只是它们接收xml节点以进行进一步的计算。这些用于提取叶级信息
branches=list(name=function(node) {
name[[as.character(uid)]] <- xmlValue(node)
}, address=function(node) {
address[[as.character(uid)]] <- xmlValue(node)
})
使用某些功能(我称之为“最终”)来处理收集的数据,特别是将每个环境强制转换为data.frame
,这也很有用。final=list(key=function() {
k <- as.list(key)
data.frame(uid=as.integer(names(k)), value=as.character(k))
}, name=function() {
k <- as.list(name)
data.frame(uid=as.integer(names(k)), name=as.character(k),
stringsAsFactors=FALSE)
}, address=function() {
k <- as.list(address)
data.frame(uid=as.integer(names(k)), address=as.character(k),
stringsAsFactors=FALSE)
})
我将所有这些放在一个工厂里。我可以用来创建独立的实例来解析我的文件
events_factory <- function() {
uid <- 0L
key <- new.env(parent=emptyenv())
name <- new.env(parent=emptyenv())
address <- new.env(parent=emptyenv())
list(handlers=list(deal=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "deal"
}, buyer=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "buyer"
}, seller=function(...) {
uid <<- uid + 1L
key[[as.character(uid)]] <- "seller"
}),
branches=list(name=function(node) {
name[[as.character(uid)]] <- xmlValue(node)
}, address=function(node) {
address[[as.character(uid)]] <- xmlValue(node)
}),
final=list(key=function() {
k <- as.list(key)
data.frame(uid=as.integer(names(k)), value=as.character(k))
}, name=function() {
k <- as.list(name)
data.frame(uid=as.integer(names(k)), name=as.character(k),
stringsAsFactors=FALSE)
}, address=function() {
k <- as.list(address)
data.frame(uid=as.integer(names(k)), address=as.character(k),
stringsAsFactors=FALSE)
}))
}
在使用中,代码看起来像
library(XML)
fname <- "~/Downloads/snippet.xml"
e <- events_factory()
invisible(xmlEventParse(fname, e$handlers, branches=e$branches))
Reduce(function(x, y) merge(x, y, all.x=TRUE),
lapply(e$final, do.call, list()))
导致
> Reduce(function(x, y) merge(x, y, all.x=TRUE),
+ lapply(e$final, do.call, list()))
uid value name address
1 1 deal <NA> <NA>
2 2 seller Dave Street name
3 3 buyer John <NA>
4 4 buyer Pete Avenue name
5 5 deal <NA> <NA>
6 6 seller Linda <NA>
7 7 seller Anne Road name
8 8 buyer Monica Alley name
代码中有很多重复,所以可能有巧妙的方法使它更紧凑。