R:用事务解析XML到平面文件?

时间:2016-05-06 09:40:09

标签: xml r xml-parsing

不幸的是,我无法解析以下示例,并且无法在此处找到类似的解决方案。 例如:

<?xml version="1.0" encoding="UTF-8"?>
<FISC V="1">
<EJ ID="61017">

<DAT V="1" FN="0000000000" ZN="6101201227" TN="000000000000" T="0">
  <C>
    <P C="1" NM="Good1" PRC="2500" Q="2000" SM="5000" TX="1" N="1" />
    <P C="4" NM="Good4" PRC="1000" Q="1000" SM="1000" TX="1" N="2" />
    <M NM="CASH" SM="6000" T="0" N="3" />
    <E CS="2" NO="4730" SM="6000" N="4">
      <TX DTPR="0.00" TX="1" TXPR="0.00" TXSM="0" TXTY="0" />
      <TX DTPR="0.00" TX="0" TXPR="0.00" TXSM="0" TXTY="0" />
    </E>
  </C>
  <TS>20140601101226</TS>
</DAT>

<DAT V="1" FN="0000000000" ZN="6101201227" TN="000000000000" T="0">
  <C>
    <P C="7" NM="Good7" PRC="1200" Q="1000" SM="1200" TX="1" N="1" />
    <M NM="CAH" SM="1200" T="0" N="2" />
    <E CS="2" NO="4731" SM="1200" N="3">
      <TX DTPR="0.00" TX="1" TXPR="0.00" TXSM="0" TXTY="0" />
      <TX DTPR="0.00" TX="0" TXPR="0.00" TXSM="0" TXTY="0" />
    </E>
  </C>
  <TS>20140601104322</TS>
</DAT>

</EJ>
</FISC>

我想将其削减如下:

NO      NM
4730    Good1
4730    Good4
4731    Good7

否 - 来自DAT / C / E的属性

NM - 来自DAT / C / P的属性

我尝试了什么:

require(XML)
test <- xmlParse('data.xml', encoding = 'UTF-8')
NM <- getNodeSet(test, "/FISC/EJ//P")
NO <- getNodeSet(test, "/FISC/EJ//E[@NO]")

require(rvest)
d <- read_html('data.xml', encoding = 'UTF-8')
ids <- data.frame(id = d %>% html_nodes("e") %>% html_attr("no"),
                  name = d %>% html_nodes("p") %>% html_attr("nm"))

但是每个节点DAT都有一个或多个子节点P.这就是我无法将结果绑定在一起的原因。

非常感谢任何帮助,谢谢。

1 个答案:

答案 0 :(得分:0)

library(xml2)
library(purrr)
library(tibble)

read_xml('<?xml version="1.0" encoding="UTF-8"?>
<FISC V="1">
<EJ ID="61017">

<DAT V="1" FN="0000000000" ZN="6101201227" TN="000000000000" T="0">
  <C>
    <P C="1" NM="Good1" PRC="2500" Q="2000" SM="5000" TX="1" N="1" />
    <P C="4" NM="Good4" PRC="1000" Q="1000" SM="1000" TX="1" N="2" />
    <M NM="CASH" SM="6000" T="0" N="3" />
    <E CS="2" NO="4730" SM="6000" N="4">
      <TX DTPR="0.00" TX="1" TXPR="0.00" TXSM="0" TXTY="0" />
      <TX DTPR="0.00" TX="0" TXPR="0.00" TXSM="0" TXTY="0" />
    </E>
  </C>
  <TS>20140601101226</TS>
</DAT>

<DAT V="1" FN="0000000000" ZN="6101201227" TN="000000000000" T="0">
  <C>
    <P C="7" NM="Good7" PRC="1200" Q="1000" SM="1200" TX="1" N="1" />
    <M NM="CAH" SM="1200" T="0" N="2" />
    <E CS="2" NO="4731" SM="1200" N="3">
      <TX DTPR="0.00" TX="1" TXPR="0.00" TXSM="0" TXTY="0" />
      <TX DTPR="0.00" TX="0" TXPR="0.00" TXSM="0" TXTY="0" />
    </E>
  </C>
  <TS>20140601104322</TS>
</DAT>

</EJ>
</FISC>') -> doc

# target the "E" nodes and iterate over them

map_df(xml_find_all(doc, "//DAT/C/E"), function(x) {

  # target the sibling nodes of the current "E" node

  p <- xml_find_all(x, "../P")

  # extract the attributes you want

  no <- xml_attr(x, "NO")
  nm <- xml_attr(p, "NM")

  # make a data frame from them
  # map_df() will bind them all together for you

  data_frame(NO=no, NM=nm)

})

## Source: local data frame [3 x 2]
## 
##      NO    NM
##   (chr) (chr)
## 1  4730 Good1
## 2  4730 Good4
## 3  4731 Good7