使用R:xpathSApply刮擦XML文件

时间:2015-12-16 20:21:40

标签: xml r

我正在学习R来刮大型XML(高达100mb),所以绝对不是专业人士。 XML文件遵循非常严格的格式:每个节点都是涉及卖方(一个或多个),买方(一个或多个)以及正在转移的股票类型(一个或多个)的交易。其中每个都有一个或多个细节(姓名,地址等)。这是一个匿名片段:

    <deals>
   <deal>
      <sellers>
         <seller>
            <name>Dave</name>
            <address>Street name</address>
            <city>New York, NY</city>
         </seller>
      </sellers>
      <buyers>
         <buyer>
            <name>John</name>
            <city>Denver, CO</city>
            <phone>123456789</phone>
         </buyer>
         <buyer>
            <name>Pete</name>
            <address>Avenue name</address>
            <city>Kansas, MI</city>
         </buyer>
      </buyers>
      <stocks>
         <stock>
            <id>GOOGL</id>
         </stock>
         <stock>
            <id>MSFT</id>
            <id>0000789019</id>
         </stock>
      </stocks>
   </deal>

   <deal>
      <sellers>
         <seller>
            <name>Linda</name>
            <city>Philadelphia, PA</city>
            <phone>876-543-210</phone>
         </seller>
         <seller>
            <name>Anne</name>
            <address>Road name</address>
         </seller>
      </sellers>
      <buyers>
         <buyer>
            <name>Monica</name>
            <address>Alley name</address>
            <city>Pensacola, CA</city>
         </buyer>
      </buyers>
      <stocks>
         <stock>
            <id>INTC</id>
            <id>0000050863</id>
         </stock>
         <stock>
            <id>DELL</id>
         </stock>
         <stock>
            <id>HPQ</id>
            <id>0000047217</id>
         </stock>
      </stocks>
   </deal>
</deals>

当试图抓取数据时,问题出在“一个或多个”中。现在,我只想创建一个包含交易号(序列号)和卖家信息的数据框,并使用以下代码:

xmldata <- xmlRoot(xmlTreeParse("snippet.xml", useInternalNodes = TRUE))
seller_name <- xpathSApply(xmldata, "//deal/sellers/seller/name", xmlValue)
seller_address <- xpathSApply(xmldata, "//deal/sellers/seller/address", xmlValue)
seller_city <- xpathSApply(xmldata, "//deal/sellers/seller/city", xmlValue)
seller_phone <- xpathSApply(xmldata, "//deal/sellers/seller/phone", xmlValue)

不幸的是,由于两个原因,这不起作用。首先,我无法确定哪个卖家属于哪个交易。其次,由于许多细节是可选的(地址,城市,电话号码),向量的长度各不相同,我无法告诉谁属于街道名称或电话号码:

> seller_name
[1] "Dave"  "Linda" "Anne" 
> seller_address
[1] "Street name" "Road name"  
> seller_phone
[1] "876-543-210"

我尝试使用for循环遍历各个交易,但它太慢了。非常感谢任何帮助,谢谢!!

3 个答案:

答案 0 :(得分:1)

创建一个函数Value,它给出xmlValue node[[name]],但如果结果为NULL则返回NA。使用它创建一个函数getRow,它检索一行数据。最后将getRow应用于XML输入,如图所示。

Value <- function(node, name) c(xmlValue(node[[name]]), NA)[1]
getRow <- function(node) sapply(c("name", "address", "city", "phone"), Value, node = node)

t(xpathSApply(xmldata, "//deal/sellers/seller", getRow))

,并提供:

     name    address       city               phone        
[1,] "Dave"  "Street name" "New York, NY"     NA           
[2,] "Linda" NA            "Philadelphia, PA" "876-543-210"
[3,] "Anne"  "Road name"   NA                 NA 

注意:为了将来的重现性,输入文件snippet.xml包含:

<?xml version="1.0" encoding="UTF-8"?>

<deals>
   <deal>
      <sellers>
         <seller>
            <name>Dave</name>
            <address>Street name</address>
            <city>New York, NY</city>
         </seller>
      </sellers>
      <buyers>
         <buyer>
            <name>John</name>
            <city>Denver, CO</city>
            <phone>123456789</phone>
         </buyer>
         <buyer>
            <name>Pete</name>
            <address>Avenue name</address>
            <city>Kansas, MI</city>
         </buyer>
      </buyers>
      <stocks>
         <stock>
            <id>GOOGL</id>
         </stock>
         <stock>
            <id>MSFT</id>
            <id>0000789019</id>
         </stock>
      </stocks>
   </deal>

   <deal>
      <sellers>
         <seller>
            <name>Linda</name>
            <city>Philadelphia, PA</city>
            <phone>876-543-210</phone>
         </seller>
         <seller>
            <name>Anne</name>
            <address>Road name</address>
         </seller>
      </sellers>
      <buyers>
         <buyer>
            <name>Monica</name>
            <address>Alley name</address>
            <city>Pensacola, CA</city>
         </buyer>
      </buyers>
      <stocks>
         <stock>
            <id>INTC</id>
            <id>0000050863</id>
         </stock>
         <stock>
            <id>DELL</id>
         </stock>
         <stock>
            <id>HPQ</id>
            <id>0000047217</id>
         </stock>
      </stocks>
   </deal>
</deals>

答案 1 :(得分:0)

这里有两个问题:

  1. 绑定具有不同列数的行 - &gt;试试dplyr的{​​{1}}
  2. 在提取卖家信息时丢失信息 - &gt;将你的xml分成多个交易,而不是循环或通过这些节点集块。

答案 2 :(得分:0)

我想我的策略是使用xmlEventParse()来遍历文件,跟踪唯一标识符以指示不同的状态(例如,&#39; deal&#39;,&#39; buyer&#39; ;,&#39;卖方&#39;)以及与该标识符相关联的名称,地址等。

我使用environment()来积累信息。

uid <- 0L
key <- new.env(parent=emptyenv())
name <- new.env(parent=emptyenv())
address <- new.env(parent=emptyenv())

xmlEventParse()允许回调处理节点。回调是作为命名的函数列表提供的,其名称对应于触发回调的xml实体。所以为了开始,我可能会有一些“处理程序”列表。在观察实体时触发。处理程序只增加唯一标识符并记录相应的状态

handlers=list(deal=function(...) {
    uid <<- uid + 1L
    key[[as.character(uid)]] <- "deal"
}, buyer=function(...) {
    uid <<- uid + 1L
    key[[as.character(uid)]] <- "buyer"
}, seller=function(...) {
    uid <<- uid + 1L
    key[[as.character(uid)]] <- "seller"
})

&#39;分枝&#39;就像处理程序一样,只是它们接收xml节点以进行进一步的计算。这些用于提取叶级信息

branches=list(name=function(node) {
    name[[as.character(uid)]] <- xmlValue(node)
}, address=function(node) {
    address[[as.character(uid)]] <- xmlValue(node)
})

使用某些功能(我称之为“最终”)来处理收集的数据,特别是将每个环境强制转换为data.frame

,这也很有用。
final=list(key=function() {
    k <- as.list(key)
    data.frame(uid=as.integer(names(k)), value=as.character(k))
}, name=function() {
    k <- as.list(name)
    data.frame(uid=as.integer(names(k)), name=as.character(k),
               stringsAsFactors=FALSE)
}, address=function() {
    k <- as.list(address)
    data.frame(uid=as.integer(names(k)), address=as.character(k),
               stringsAsFactors=FALSE)
})

我将所有这些放在一个工厂里。我可以用来创建独立的实例来解析我的文件

events_factory <- function() {
    uid <- 0L
    key <- new.env(parent=emptyenv())
    name <- new.env(parent=emptyenv())
    address <- new.env(parent=emptyenv())

    list(handlers=list(deal=function(...) {
             uid <<- uid + 1L
             key[[as.character(uid)]] <- "deal"
         }, buyer=function(...) {
             uid <<- uid + 1L
             key[[as.character(uid)]] <- "buyer"
         }, seller=function(...) {
             uid <<- uid + 1L
             key[[as.character(uid)]] <- "seller"
         }),

         branches=list(name=function(node) {
             name[[as.character(uid)]] <- xmlValue(node)
         }, address=function(node) {
             address[[as.character(uid)]] <- xmlValue(node)
         }),

         final=list(key=function() {
             k <- as.list(key)
             data.frame(uid=as.integer(names(k)), value=as.character(k))
         }, name=function() {
             k <- as.list(name)
             data.frame(uid=as.integer(names(k)), name=as.character(k),
                        stringsAsFactors=FALSE)
         }, address=function() {
             k <- as.list(address)
             data.frame(uid=as.integer(names(k)), address=as.character(k),
                        stringsAsFactors=FALSE)
         }))
}

在使用中,代码看起来像

library(XML)
fname <- "~/Downloads/snippet.xml"
e <- events_factory()
invisible(xmlEventParse(fname, e$handlers, branches=e$branches))
Reduce(function(x, y) merge(x, y, all.x=TRUE),
       lapply(e$final, do.call, list()))

导致

> Reduce(function(x, y) merge(x, y, all.x=TRUE),
+        lapply(e$final, do.call, list()))
  uid  value   name     address
1   1   deal   <NA>        <NA>
2   2 seller   Dave Street name
3   3  buyer   John        <NA>
4   4  buyer   Pete Avenue name
5   5   deal   <NA>        <NA>
6   6 seller  Linda        <NA>
7   7 seller   Anne   Road name
8   8  buyer Monica  Alley name

代码中有很多重复,所以可能有巧妙的方法使它更紧凑。