将xml文件转换为R中的数据帧并回溯源

时间:2018-07-09 17:27:44

标签: r xml

我有一个XML文件,其中包含:

 <?xml version="1.0" encoding="UTF-8" ?>
<Repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DECLARE>
<PhysicalColumn name="Department" parentName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;" parentId="3001:129" parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;SAMP_OFFICES_D&quot;.&quot;Department&quot;"/>
</SourceColumn>
</PhysicalColumn>

<LogicalTable name="D2 Offices" parentName="&quot;SampleApp Lite&quot;" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office&quot;"/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Key&quot;"/>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Dept Key&quot;"/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Organization&quot;"/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Org Key&quot;"/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company&quot;"/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company Key&quot;"/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Sequence&quot;"/>
</Columns>
<TableSources>
<RefLogicalTableSource id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;LTS1 Offices&quot;"/>
</TableSources>
</LogicalTable>

<LogicalTableSource name="LTS1 Offices" parentName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<RefPhysicalTable id="3001:129" uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;"/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>

<PresentationColumn name="Department" parentName="&quot;Sample Targets Lite&quot;..&quot;Offices&quot;" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<RefLogicalColumn id="2006:42565" uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
</PresentationColumn>
</DECLARE>
</Repository>

从那里我需要找到Presentation Column的源,它是使用不同ID的物理列名称和物理表。 例如,我有PresentationColumn name = Department的RefLogicalColumn id =“ 2006:42565”。

<**PresentationColumn name="Department"** parentName="&quot;Sample Targets Lite&quot;..&quot;Offices&quot;" parentId="4008:43412" parentUid="80cb6c16-0fa8-0000-714b-e31d00000000" id="4010:43649" uid="80cb6d77-0faa-0000-714b-e31d00000000" hasDispName="false" hasDispDescription="false" overrideLogicalName="false">
<Description><![CDATA[Returns the Department description from the Office dimension. Naturally drills into Office Column.]]></Description>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
</PresentationColumn>

通过使用RefLogicalColumn id =“ 2006:42565”,我们将使用RefLogicalColumn id在LogicalTable中进行搜索。

<LogicalTable name="D2 Offices" parentName="&quot;SampleApp Lite&quot;" parentId="2000:42377" parentUid="80cb6802-07d0-0000-714b-e31d00000000" id="2035:42562" uid="80cb68bb-07f3-0000-714b-e31d00000000" x="938" y="669">
<Description><![CDATA[This logical table maps to the physical Office Dimension table with various attributes.]]></Description>
<Columns>
<RefLogicalColumn id="2006:42563" uid="80cb68bc-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office&quot;"/>
<RefLogicalColumn id="2006:42564" uid="80cb68bd-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Key&quot;"/>
<**RefLogicalColumn id="2006:42565"** uid="80cb68be-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Department&quot;"/>
<RefLogicalColumn id="2006:42566" uid="80cb68bf-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Dept Key&quot;"/>
<RefLogicalColumn id="2006:42567" uid="80cb68c0-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Organization&quot;"/>
<RefLogicalColumn id="2006:42568" uid="80cb68c1-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Org Key&quot;"/>
<RefLogicalColumn id="2006:42569" uid="80cb68c2-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company&quot;"/>
<RefLogicalColumn id="2006:42570" uid="80cb68c3-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Company Key&quot;"/>
<RefLogicalColumn id="2006:42571" uid="80cb68c4-07d6-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;Office Sequence&quot;"/>
</Columns>
<TableSources>
<**RefLogicalTableSource id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" qualifiedName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;.&quot;LTS1 Offices&quot;"/>
</TableSources>
</LogicalTable>

然后使用RefLogicalTableSource id = 2037:43058,我们将使用id在LogicalTableSource中进行搜索。

<LogicalTableSource name="LTS1 Offices" parentName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" **id="2037:43058"** uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<**RefPhysicalTable id="3001:129"** uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;"/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>

然后使用RefPhysicalTable id = 3001:129,我们将使用parentId在PhysicalColumn中进行搜索。

<PhysicalColumn name="Department" parentName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;" **parentId="3001:129"** parentUid="80ca6538-0bb9-0000-714b-e31d00000000" id="3003:484" uid="80ca6539-0bbb-0000-714b-e31d00000000" dataType="VARCHAR" precision="20" extName="//Table/SAMP_OFFICES_D/DEPARTMENT" specialType="none">
<SourceColumn>
<RefPhysicalColumn id="3003:427" uid="80ca64f9-0bbb-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;SAMP_OFFICES_D&quot;.&quot;Department&quot;"/>
</SourceColumn>
</PhysicalColumn>

在这里,我们需要PhysicalColumn name =“ Department”和extName =“ // Table / SAMP_OFFICES_D / DEPARTMENT”

我的第一个问题是将xml文件转换为数据帧,第二个是回溯源。

2 个答案:

答案 0 :(得分:0)

xml2::read_xml将帮助您阅读。另一个会更困难,因为看起来您有3个关系表。请参阅this page,可能还请参阅this,尽管在我尝试时将其合并到一张表中使它很混乱。

library(xml2)
library(tidyverse)
dfxml <- xml2::read_xml("C:/foo/bar.xml")

mcga <- function(tbl) {
  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  colnames(tbl) <- x
  tbl
}

dfxlm2 <- xml_find_all(dfxml1, ".//*") %>% 
  map_df(~{
   xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

或将它们分成3个表。

LogicalTable <- xml_find_all(dfxml1, ".//LogicalTable//*") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

PhysicalTable <- xml_find_all(dfxml1, ".//PhysicalColumn") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

LogTable <- xml_find_all(dfxml1, ".//LogicalTableSource//*") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

您想如何跟踪这些?

答案 1 :(得分:0)

我们可以修改此代码以在同一数据框中获得偶数子属性值。

LogTable <- xml_find_all(dfxml1, ".//LogicalTableSource//*") %>% 
  map_df(~{
    xml_attrs(.x) %>% 
      as.list()
  }) %>% 
  mcga()

Xml是:

<LogicalTableSource name="LTS1 Offices" parentName="&quot;SampleApp Lite&quot;.&quot;D2 Offices&quot;" parentId="2035:42562" parentUid="80cb68bb-07f3-0000-714b-e31d00000000" id="2037:43058" uid="80cb6a2c-07f5-0000-714b-e31d00000000" isActive="true">
<Link>
<StartNode>
<RefPhysicalTable id="3001:129" uid="80ca6538-0bb9-0000-714b-e31d00000000" qualifiedName="&quot;Sample App Lite Data&quot;...&quot;D20 Offices&quot;"/>
</StartNode>
</Link>
<WhereClause>
<Expr></Expr>
</WhereClause>
<GroupBy>
<Expr><![CDATA[ GROUPBYLEVEL("SampleApp Lite"."H2 Offices"."Offices Detail")]]></Expr>
</GroupBy>
<FragmentContent>
<Expr></Expr>
</FragmentContent>
</LogicalTableSource>

目前,我正在为LogicalTableSource设置值,但需要在同一数据帧中包含RefPhysicalTable的值。非常感谢您的帮助。