R中的复杂XML解析

时间:2018-08-24 18:14:16

标签: r xml parsing

我正在尝试解析嵌套的xml文件。

<GENERIC_ROUGHDRAFT>
     <HEADER compName="California" dateCreated="2018-08-07">
      <COMP_INFO>
    </HEADER>
     <COVERSHEET>
      <ESTIMATE_INFO eName="MATTHEW_ANDERSON" iName="Matthew Anderson" priceList="MAY18" laborEff="Restoration/Service/Remodel" claimNumber="01" policyNumber="00000000000" typeOfLoss="Hail" deprNonMat="1" deprOandP="1" deprTaxes="1" onsite="1" recipientsXNAddress="California_BD" carrierId="111111" estimateType="Structural"/>
      <ADDRESSES>
       <ADDRESS type="Property" street="123 Street Cr" city="Idaho Falls" state="ID" zip="00000" primary="1"/>
       <ADDRESS type="Home" street="123 Street Cr" city="City" state="ID" zip="00000"/>
      </ADDRESSES>
  </COVERSHEET>
</GENERIC_ROUGHDRAFT>

我正在尝试提取iName和价目表之类的信息。

对于我的最终产品,我希望有一个数据框,其中的信息看起来像只包含以下内容的一行:

compName  |  dataCreated  | iName  |  Type  | Street|  | State

关于如何在一个段内提取多个数据的文档很少。

有什么建议吗?

1 个答案:

答案 0 :(得分:3)

XML文件可以具有很多嵌套,这会使它们难以直接转换为data.frame。我认为,提取这些文件的最简单方法是使用xslt将其重塑为表格格式。

使用示例数据

library(xml2)
xml <- read_xml('<GENERIC_ROUGHDRAFT>
     <HEADER compName="California" dateCreated="2018-08-07">
      <COMP_INFO/>
    </HEADER>
     <COVERSHEET>
      <ESTIMATE_INFO eName="MATTHEW_ANDERSON" iName="Matthew Anderson" priceList="MAY18" laborEff="Restoration/Service/Remodel" claimNumber="01" policyNumber="00000000000" typeOfLoss="Hail" deprNonMat="1" deprOandP="1" deprTaxes="1" onsite="1" recipientsXNAddress="California_BD" carrierId="111111" estimateType="Structural"/>
      <ADDRESSES>
       <ADDRESS type="Property" street="123 Street Cr" city="Idaho Falls" state="ID" zip="00000" primary="1"/>
       <ADDRESS type="Home" street="123 Street Cr" city="City" state="ID" zip="00000"/>
      </ADDRESSES>
  </COVERSHEET>
</GENERIC_ROUGHDRAFT>')

我们可以定义一个xslt来将数据转换为html表

xsl <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
  <xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/GENERIC_ROUGHDRAFT">
    <html>
    <table>
    <tr>
    <td>compName</td><td>dateCreated</td>
    <td>iName</td><td>type</td><td>street</td><td>state</td>
    </tr>
    <xsl:for-each select="//ADDRESS">
    <tr>
    <td><xsl:value-of select="../../../HEADER/@compName"/></td>
    <td><xsl:value-of select="../../../HEADER/@dateCreated"/></td>
    <td><xsl:value-of select="../../ESTIMATE_INFO/@iName"/></td>
    <td><xsl:value-of select="@type"/></td>
    <td><xsl:value-of select="@street"/></td>
    <td><xsl:value-of select="@state"/></td>
    </tr>
    </xsl:for-each>
    </table>
    </html>
    </xsl:template>
    </xsl:stylesheet>')

我将它制成了HTML表,以便可以使用rvest::html_table将其变成data.frame。这样可以做到

library(xslt)
library(rvest)
data <- xml_xslt(xml, xsl) %>% html_table(header = TRUE)  %>% .[[1]]
#     compName dateCreated            iName     type        street state
# 1 California  2018-08-07 Matthew Anderson Property 123 Street Cr    ID
# 2 California  2018-08-07 Matthew Anderson     Home 123 Street Cr    ID