我有多个XML文件,我想从中提取特定级别的某些部分,并将这些值存储在data.frame
中。级别总是被称为相同,即“发票”。
我想从“Invoice”级别提取数据。此级别的所有子级都应该是一个行实体。对于每个行实体,应提取value
,confidence
和zone
。
唯一的问题是每个文档的实体数量各不相同。
data.frame
应如下所示:
Doc. Nr. Entity Value Zone Confidence
doc1 OcrText Text example 19 101 941 2625 76
doc1 InvoiceDate 17/06/2016 105 8 862 1555 100
doc1 InvoiceDate__day 17 105 8 862 1555 100
借助包rvest
和XML
,我可以提取zone
。
read_xml(xmlfile) %>% xml_nodes("Invoice") %>% xml_nodes("zone") %>% xml_text()
但我无法提取value
,confidence
以及“发票”级别子项的所有名称。
这是XML文件的一个示例:
<?xml version="1.0" encoding="utf-8"?>
<DOKuStar baseType="documentType" state="Ok" confidence="0" version="2.0">
<Invoice baseType="documentType" state="Ok" confidence="0" producer="DOKuStar">
<sources>
<image guid=" fec8" />
</sources>
<OcrText baseType="fieldType" state="Reject" confidence="76">
<value> Text example
</value>
<zone>19 101 941 2625</zone>
<sources>
<image guid=" fec8" />
</sources>
</OcrText>
<InvoiceDate baseType="fieldType" state="Empty" confidence="100" class="dateType">
<value>17-06-2016
</value>
<zone>105 8 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceDate>
<annotations>
<annotation key="FileOutputPath">E:\..\Outgoing\</annotation>
</annotations>
<InvoiceDate__day baseType="fieldType" state="Empty" confidence="100">
<value>17
</value>
<zone>105 8 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceDate__day>
<InvoiceDate__month baseType="fieldType" state="Empty" confidence="100">
<value>06
</value>
<zone>105 8 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceDate__month>
<InvoiceDate__year baseType="fieldType" state="Empty" confidence="100">
<value>2016
</value>
<zone>105 8 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceDate__year>
<InvoiceNumber baseType="fieldType" state="Empty" confidence="100">
<value>12365
</value>
<zone>105 80 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceNumber>
<InvoiceTotalsTotalAmount baseType="fieldType" state="Ok" confidence="87">
<value>21.98</value>
<zone>595 2062 77 34</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceTotalsTotalAmount>
<InvoiceTotalsNetAmount baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceTotalsNetAmount>
<InvoiceTotalsVatAmount baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceTotalsVatAmount>
<InvoiceTotalsCurrency baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceTotalsCurrency>
<InvoiceTotals baseType="tableType" state="Ok" confidence="87">
<value>21.98 </value>
<zone>595 2062 77 34</zone>
<sources>
<image guid=" fec8" />
</sources>
<row baseType="tableRowType" state="Ok" confidence="0">
<TotalAmount baseType="fieldType" state="Ok" confidence="100">
<value>3.10</value>
<zone>596 2029 63 30</zone>
<sources>
<image guid=" fec8" />
</sources>
</TotalAmount>
<NetAmount baseType="fieldType" state="Ok" confidence="69">
<value>2.56</value>
<zone>287 2031 64 31</zone>
<sources>
<image guid=" fec8" />
</sources>
</NetAmount>
<VatAmount baseType="fieldType" state="Ok" confidence="78">
<value>0.54</value>
<zone>444 2030 59 31</zone>
<sources>
<image guid=" fec8" />
</sources>
</VatAmount>
<VatRate baseType="fieldType" state="Ok" confidence="83">
<value>21.00</value>
<zone>141 2035 30 26</zone>
<sources>
<image guid=" fec8" />
</sources>
</VatRate>
<Currency baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Currency>
<Type baseType="fieldType" state="Ok" confidence="0">
<value>Vat</value>
</Type>
</row>
<row baseType="tableRowType" state="Ok" confidence="0">
<TotalAmount baseType="fieldType" state="Ok" confidence="56">
<value>18.88</value>
<zone>603 1993 73 33</zone>
<sources>
<image guid=" fec8" />
</sources>
</TotalAmount>
<NetAmount baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</NetAmount>
<VatAmount baseType="fieldType" state="Ok" confidence="57">
<value>2.99</value>
<zone>653 1311 62 33</zone>
<sources>
<image guid=" fec8" />
</sources>
</VatAmount>
<VatRate baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</VatRate>
<Currency baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Currency>
<Type baseType="fieldType" state="Ok" confidence="0">
<value>Vat</value>
</Type>
</row>
</InvoiceTotals>
<Address baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address>
<Address__firstname baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__firstname>
<Address__lastname baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__lastname>
<Address__city baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__city>
<Address__cityline baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__cityline>
<Address__nameline baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__nameline>
<Address__streetline baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__streetline>
<Address__streetname baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__streetname>
<Address__streetnumber baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__streetnumber>
<Address__zipcode baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Address__zipcode>
<Postcode baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Postcode>
<BankAccountNumber baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</BankAccountNumber>
<InvoiceAcceptgiroCode baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</InvoiceAcceptgiroCode>
<Website baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</Website>
<EmailAddress baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</EmailAddress>
<BICCode baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</BICCode>
<CoCNumber baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</CoCNumber>
<DebtorNumber baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</DebtorNumber>
<IBANCode baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</IBANCode>
<IsCreditNote baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>105 8 862 1555</zone>
<sources>
<image guid=" fec8" />
</sources>
</IsCreditNote>
<IsKvKInvoice baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</IsKvKInvoice>
<VATNumber baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>0 8 967 2974</zone>
<sources>
<image guid=" fec8" />
</sources>
</VATNumber>
<ScanFormAdministration baseType="fieldType" state="Empty" confidence="0">
<value>
</value>
<zone>215 15 1 1</zone>
<sources>
<image guid=" fec8" />
</sources>
</ScanFormAdministration>
</Invoice>
<sourceInstances>
</sourceInstances>
<annotations>
</annotations>
</DOKuStar>
答案 0 :(得分:1)
这是另一个版本,没有任何错误检查:
library(rvest)
invoices<-read_xml("xmltext2.xml") %>% xml_nodes("Invoice")
children<-xml_children(invoices)
Entity<-xml_name(children)
Confidence<-xml_attr(children, "confidence")
df<-data.frame(Entity, Confidence)
df<-df[complete.cases(df),]
Value<-xml_find_all(children, "value") %>% xml_text()
Zone<-xml_find_all(children, "zone") %>% xml_text()
df<-cbind(df, Value, Zone)
df$Value<-trimws(df$Value)
这适用于所提供的测试。杂散节点,例如源和注释。通过更多的工作,这可以扩展到捕获发票小计。
答案 1 :(得分:0)
考虑使用XSLT转换原始XML,{{3}}是用于将XML文件转换为各种结构以满足最终用途需求的特殊用途声明性语言。一旦展平和简化,您就可以使用简单的xmlToDataFrame()
读入R.
虽然R在流行的软件包中没有通用的XSLT 1.0处理器,但R可以通过其他通用语言(Java,Python,PHP,VB),命令行解释器(Bash,PowerShell)或专用安装来利用XSLT处理器使用system()
调用的XSLT处理器(Xalan,Saxon):
XSLT 脚本(另存为.xsl并用于上述程序的外部调用/脚本)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="DOKuStar">
<data>
<xsl:apply-templates select="Invoice/*" />
</data>
</xsl:template>
<!-- Extracts all children of Invoice Element -->
<xsl:template match="Invoice/*">
<row>
<doc.nr.>doc1</doc.nr.>
<entity><xsl:value-of select="local-name()"/></entity>
<value><xsl:apply-templates select="value"/></value>
<zone><xsl:apply-templates select="zone"/></zone>
<confidence><xsl:value-of select="@confidence"/></confidence>
</row>
</xsl:template>
<!-- Remove multiple white space in value and zone text values -->
<xsl:template match="value|zone">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:transform>
XML 输出
<?xml version='1.0' encoding='UTF-8'?>
<data>
<row>
<doc.nr.>doc1</doc.nr.>
<entity>sources</entity>
<value/>
<zone/>
<confidence/>
</row>
<row>
<doc.nr.>doc1</doc.nr.>
<entity>OcrText</entity>
<value>Text example</value>
<zone>19 101 941 2625</zone>
<confidence>76</confidence>
</row>
<row>
<doc.nr.>doc1</doc.nr.>
<entity>InvoiceDate</entity>
<value>17-06-2016</value>
<zone>105 8 862 1555</zone>
<confidence>100</confidence>
</row>
<row>
<doc.nr.>doc1</doc.nr.>
<entity>annotations</entity>
<value/>
<zone/>
<confidence/>
</row>
...
R 脚本(运行Python xslt转换脚本,输出R然后读入的文件)
library(XML)
system('python "C:\\Path\\To\\TransformScript.py"')
df <- xmlToDataFrame("C:\\Path\\To\\Output.xml", nodes = "row")
df
# doc.nr. entity value zone confidence
# 1 doc1 sources
# 2 doc1 OcrText Text example 19 101 941 2625 76
# 3 doc1 InvoiceDate 17-06-2016 105 8 862 1555 100
# 4 doc1 annotations
# 5 doc1 InvoiceDate__day 17 105 8 862 1555 100
# 6 doc1 InvoiceDate__month 06 105 8 862 1555 100
# 7 doc1 InvoiceDate__year 2016 105 8 862 1555 100
# 8 doc1 InvoiceNumber 12365 105 80 862 1555 100
# 9 doc1 InvoiceTotalsTotalAmount 21.98 595 2062 77 34 87
# 10 doc1 InvoiceTotalsNetAmount 0 8 967 2974 0
# 11 doc1 InvoiceTotalsVatAmount 0 8 967 2974 0
# ...