在R

时间:2016-06-17 12:42:09

标签: r xml dataframe

我有多个XML文件,我想从中提取特定级别的某些部分,并将这些值存储在data.frame中。级别总是被称为相同,即“发票”。

我想从“Invoice”级别提取数据。此级别的所有子级都应该是一个行实体。对于每个行实体,应提取valueconfidencezone

唯一的问题是每个文档的实体数量各不相同。

data.frame应如下所示:

Doc. Nr.    Entity             Value         Zone               Confidence
doc1        OcrText            Text example  19 101 941 2625    76
doc1        InvoiceDate        17/06/2016    105 8 862 1555     100
doc1        InvoiceDate__day   17            105 8 862 1555     100

借助包rvestXML,我可以提取zone

read_xml(xmlfile) %>% xml_nodes("Invoice") %>% xml_nodes("zone") %>% xml_text()

但我无法提取valueconfidence以及“发票”级别子项的所有名称。

这是XML文件的一个示例:

<?xml version="1.0" encoding="utf-8"?>
<DOKuStar baseType="documentType" state="Ok" confidence="0" version="2.0">
  <Invoice baseType="documentType" state="Ok" confidence="0" producer="DOKuStar">
    <sources>
      <image guid=" fec8" />
    </sources>
    <OcrText baseType="fieldType" state="Reject" confidence="76">
      <value> Text example
      </value>
      <zone>19 101 941 2625</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </OcrText>
    <InvoiceDate baseType="fieldType" state="Empty" confidence="100" class="dateType">
      <value>17-06-2016
      </value>
      <zone>105 8 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceDate>
    <annotations>
      <annotation key="FileOutputPath">E:\..\Outgoing\</annotation>
    </annotations>
    <InvoiceDate__day baseType="fieldType" state="Empty" confidence="100">
      <value>17
      </value>
      <zone>105 8 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceDate__day>
    <InvoiceDate__month baseType="fieldType" state="Empty" confidence="100">
      <value>06
      </value>
      <zone>105 8 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceDate__month>
    <InvoiceDate__year baseType="fieldType" state="Empty" confidence="100">
      <value>2016
      </value>
      <zone>105 8 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceDate__year>
    <InvoiceNumber baseType="fieldType" state="Empty" confidence="100">
      <value>12365
      </value>
      <zone>105 80 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceNumber>
    <InvoiceTotalsTotalAmount baseType="fieldType" state="Ok" confidence="87">
      <value>21.98</value>
      <zone>595 2062 77 34</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceTotalsTotalAmount>
    <InvoiceTotalsNetAmount baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceTotalsNetAmount>
    <InvoiceTotalsVatAmount baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceTotalsVatAmount>
    <InvoiceTotalsCurrency baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceTotalsCurrency>
    <InvoiceTotals baseType="tableType" state="Ok" confidence="87">
      <value>21.98                  </value>
      <zone>595 2062 77 34</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
      <row baseType="tableRowType" state="Ok" confidence="0">
        <TotalAmount baseType="fieldType" state="Ok" confidence="100">
          <value>3.10</value>
          <zone>596 2029 63 30</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </TotalAmount>
        <NetAmount baseType="fieldType" state="Ok" confidence="69">
          <value>2.56</value>
          <zone>287 2031 64 31</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </NetAmount>
        <VatAmount baseType="fieldType" state="Ok" confidence="78">
          <value>0.54</value>
          <zone>444 2030 59 31</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </VatAmount>
        <VatRate baseType="fieldType" state="Ok" confidence="83">
          <value>21.00</value>
          <zone>141 2035 30 26</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </VatRate>
        <Currency baseType="fieldType" state="Empty" confidence="0">
          <value>
          </value>
          <zone>0 8 967 2974</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </Currency>
        <Type baseType="fieldType" state="Ok" confidence="0">
          <value>Vat</value>
        </Type>
      </row>
      <row baseType="tableRowType" state="Ok" confidence="0">
        <TotalAmount baseType="fieldType" state="Ok" confidence="56">
          <value>18.88</value>
          <zone>603 1993 73 33</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </TotalAmount>
        <NetAmount baseType="fieldType" state="Empty" confidence="0">
          <value>
          </value>
          <zone>0 8 967 2974</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </NetAmount>
        <VatAmount baseType="fieldType" state="Ok" confidence="57">
          <value>2.99</value>
          <zone>653 1311 62 33</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </VatAmount>
        <VatRate baseType="fieldType" state="Empty" confidence="0">
          <value>
          </value>
          <zone>0 8 967 2974</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </VatRate>
        <Currency baseType="fieldType" state="Empty" confidence="0">
          <value>
          </value>
          <zone>0 8 967 2974</zone>
          <sources>
            <image guid=" fec8" />
          </sources>
        </Currency>
        <Type baseType="fieldType" state="Ok" confidence="0">
          <value>Vat</value>
        </Type>
      </row>
    </InvoiceTotals>
    <Address baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address>
    <Address__firstname baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__firstname>
    <Address__lastname baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__lastname>
    <Address__city baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__city>
    <Address__cityline baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__cityline>
    <Address__nameline baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__nameline>
    <Address__streetline baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__streetline>
    <Address__streetname baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__streetname>
    <Address__streetnumber baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__streetnumber>
    <Address__zipcode baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Address__zipcode>
    <Postcode baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Postcode>
    <BankAccountNumber baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </BankAccountNumber>
    <InvoiceAcceptgiroCode baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </InvoiceAcceptgiroCode>
    <Website baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </Website>
    <EmailAddress baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </EmailAddress>
    <BICCode baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </BICCode>
    <CoCNumber baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </CoCNumber>
    <DebtorNumber baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </DebtorNumber>
    <IBANCode baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </IBANCode>
    <IsCreditNote baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>105 8 862 1555</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </IsCreditNote>
    <IsKvKInvoice baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </IsKvKInvoice>
    <VATNumber baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>0 8 967 2974</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </VATNumber>
    <ScanFormAdministration baseType="fieldType" state="Empty" confidence="0">
      <value>
      </value>
      <zone>215 15 1 1</zone>
      <sources>
        <image guid=" fec8" />
      </sources>
    </ScanFormAdministration>
  </Invoice>
  <sourceInstances>
  </sourceInstances>
  <annotations>
  </annotations>
 </DOKuStar>

2 个答案:

答案 0 :(得分:1)

这是另一个版本,没有任何错误检查:

library(rvest)
invoices<-read_xml("xmltext2.xml") %>% xml_nodes("Invoice")
children<-xml_children(invoices)

Entity<-xml_name(children)
Confidence<-xml_attr(children, "confidence")
df<-data.frame(Entity, Confidence)
df<-df[complete.cases(df),]
Value<-xml_find_all(children, "value") %>% xml_text()
Zone<-xml_find_all(children, "zone") %>% xml_text()
df<-cbind(df, Value, Zone)
df$Value<-trimws(df$Value)

这适用于所提供的测试。杂散节点,例如源和注释。通过更多的工作,这可以扩展到捕获发票小计。

答案 1 :(得分:0)

考虑使用XSLT转换原始XML,{{3}}是用于将XML文件转换为各种结构以满足最终用途需求的特殊用途声明性语言。一旦展平和简化,您就可以使用简单的xmlToDataFrame()读入R.

虽然R在流行的软件包中没有通用的XSLT 1.0处理器,但R可以通过其他通用语言(Java,Python,PHP,VB),命令行解释器(Bash,PowerShell)或专用安装来利用XSLT处理器使用system()调用的XSLT处理器(Xalan,Saxon):

XSLT 脚本(另存为.xsl并用于上述程序的外部调用/脚本)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

    <xsl:template match="DOKuStar">
        <data>
            <xsl:apply-templates select="Invoice/*" />
        </data>
    </xsl:template>

    <!-- Extracts all children of Invoice Element -->
    <xsl:template match="Invoice/*">
        <row>
            <doc.nr.>doc1</doc.nr.>
            <entity><xsl:value-of select="local-name()"/></entity>
            <value><xsl:apply-templates select="value"/></value>
            <zone><xsl:apply-templates select="zone"/></zone>
            <confidence><xsl:value-of select="@confidence"/></confidence>
        </row>
    </xsl:template>

    <!-- Remove multiple white space in value and zone text values -->
    <xsl:template match="value|zone">       
          <xsl:value-of select="normalize-space(.)"/>       
    </xsl:template>

</xsl:transform>

XML 输出

<?xml version='1.0' encoding='UTF-8'?>
<data>
  <row>
    <doc.nr.>doc1</doc.nr.>
    <entity>sources</entity>
    <value/>
    <zone/>
    <confidence/>
  </row>
  <row>
    <doc.nr.>doc1</doc.nr.>
    <entity>OcrText</entity>
    <value>Text example</value>
    <zone>19 101 941 2625</zone>
    <confidence>76</confidence>
  </row>
  <row>
    <doc.nr.>doc1</doc.nr.>
    <entity>InvoiceDate</entity>
    <value>17-06-2016</value>
    <zone>105 8 862 1555</zone>
    <confidence>100</confidence>
  </row>
  <row>
    <doc.nr.>doc1</doc.nr.>
    <entity>annotations</entity>
    <value/>
    <zone/>
    <confidence/>
  </row>
...

R 脚本(运行Python xslt转换脚本,输出R然后读入的文件)

library(XML)

system('python "C:\\Path\\To\\TransformScript.py"')
df <- xmlToDataFrame("C:\\Path\\To\\Output.xml", nodes = "row")

df

#    doc.nr.                   entity        value            zone confidence
# 1     doc1                  sources
# 2     doc1                  OcrText Text example 19 101 941 2625         76
# 3     doc1              InvoiceDate   17-06-2016  105 8 862 1555        100
# 4     doc1              annotations
# 5     doc1         InvoiceDate__day           17  105 8 862 1555        100
# 6     doc1       InvoiceDate__month           06  105 8 862 1555        100
# 7     doc1        InvoiceDate__year         2016  105 8 862 1555        100
# 8     doc1            InvoiceNumber        12365 105 80 862 1555        100
# 9     doc1 InvoiceTotalsTotalAmount        21.98  595 2062 77 34         87
# 10    doc1   InvoiceTotalsNetAmount                 0 8 967 2974          0
# 11    doc1   InvoiceTotalsVatAmount                 0 8 967 2974          0
# ...