我正在尝试解析嵌套的xml文件。
<GENERIC_ROUGHDRAFT>
<HEADER compName="California" dateCreated="2018-08-07">
<COMP_INFO>
</HEADER>
<COVERSHEET>
<ESTIMATE_INFO eName="MATTHEW_ANDERSON" iName="Matthew Anderson" priceList="MAY18" laborEff="Restoration/Service/Remodel" claimNumber="01" policyNumber="00000000000" typeOfLoss="Hail" deprNonMat="1" deprOandP="1" deprTaxes="1" onsite="1" recipientsXNAddress="California_BD" carrierId="111111" estimateType="Structural"/>
<ADDRESSES>
<ADDRESS type="Property" street="123 Street Cr" city="Idaho Falls" state="ID" zip="00000" primary="1"/>
<ADDRESS type="Home" street="123 Street Cr" city="City" state="ID" zip="00000"/>
</ADDRESSES>
</COVERSHEET>
</GENERIC_ROUGHDRAFT>
我正在尝试提取iName和价目表之类的信息。
对于我的最终产品,我希望有一个数据框,其中的信息看起来像只包含以下内容的一行:
compName | dataCreated | iName | Type | Street| | State
关于如何在一个段内提取多个数据的文档很少。
有什么建议吗?
答案 0 :(得分:3)
XML文件可以具有很多嵌套,这会使它们难以直接转换为data.frame。我认为,提取这些文件的最简单方法是使用xslt
将其重塑为表格格式。
使用示例数据
library(xml2)
xml <- read_xml('<GENERIC_ROUGHDRAFT>
<HEADER compName="California" dateCreated="2018-08-07">
<COMP_INFO/>
</HEADER>
<COVERSHEET>
<ESTIMATE_INFO eName="MATTHEW_ANDERSON" iName="Matthew Anderson" priceList="MAY18" laborEff="Restoration/Service/Remodel" claimNumber="01" policyNumber="00000000000" typeOfLoss="Hail" deprNonMat="1" deprOandP="1" deprTaxes="1" onsite="1" recipientsXNAddress="California_BD" carrierId="111111" estimateType="Structural"/>
<ADDRESSES>
<ADDRESS type="Property" street="123 Street Cr" city="Idaho Falls" state="ID" zip="00000" primary="1"/>
<ADDRESS type="Home" street="123 Street Cr" city="City" state="ID" zip="00000"/>
</ADDRESSES>
</COVERSHEET>
</GENERIC_ROUGHDRAFT>')
我们可以定义一个xslt来将数据转换为html表
xsl <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/GENERIC_ROUGHDRAFT">
<html>
<table>
<tr>
<td>compName</td><td>dateCreated</td>
<td>iName</td><td>type</td><td>street</td><td>state</td>
</tr>
<xsl:for-each select="//ADDRESS">
<tr>
<td><xsl:value-of select="../../../HEADER/@compName"/></td>
<td><xsl:value-of select="../../../HEADER/@dateCreated"/></td>
<td><xsl:value-of select="../../ESTIMATE_INFO/@iName"/></td>
<td><xsl:value-of select="@type"/></td>
<td><xsl:value-of select="@street"/></td>
<td><xsl:value-of select="@state"/></td>
</tr>
</xsl:for-each>
</table>
</html>
</xsl:template>
</xsl:stylesheet>')
我将它制成了HTML表,以便可以使用rvest::html_table
将其变成data.frame。这样可以做到
library(xslt)
library(rvest)
data <- xml_xslt(xml, xsl) %>% html_table(header = TRUE) %>% .[[1]]
# compName dateCreated iName type street state
# 1 California 2018-08-07 Matthew Anderson Property 123 Street Cr ID
# 2 California 2018-08-07 Matthew Anderson Home 123 Street Cr ID