XML Python 解析器 - 循环嵌套节点

时间:2021-01-19 07:05:59

标签: python xml pandas parsing xml-parsing

在社区的帮助下,我获得了熊猫数据框的 XML 解析器。我注意到有一个问题需要解决。在下面的数据示例中,有一个场景,其中一个 dept 拥有 1 个以上的所有者。

当前循环提取最新的,我需要来自 owners 的每个节点

数据:

<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        date="2021-01-15">
 <dept dept_id="00001" 
            col_two="00001value" 
            col_three="00001false"
            name = "some_name">     
    <owners>
      <currentowner col_four="00001value" 
                    col_five="00001value" 
                    col_six="00001false"
                    name = "some_name">
        <addr col_seven="00001value" 
                col_eight="00001value" 
                col_nine="00001false"/>
      </currentowner>
      <currentowner col_four="00001bvalue" 
                    col_five="00001bvalue" 
                    col_six="00001bfalse"
                    name = "some_name">
        <addr col_seven="00001bvalue" 
                col_eight="00001bvalue" 
                col_nine="00001bfalse"/>
      </currentowner>
    </owners>
  </dept>
  <dept dept_id="00002" 
            col_two="00002value" 
            col_three="00002value"
            name = "some_name">
    <owners>
      <currentowner col_four="00002value" 
                    col_five="00002value" 
                    col_six="00002false"
                    name = "some_name">
        <addr col_seven="00002value" 
                col_eight="00002value" 
                col_nine="00002false"/>
      </currentowner>
    </owners>
  </dept> 
</depts>

当前代码:

import xml.etree.ElementTree as element_tree
import pandas
import fnmatch
import os

file_path = 'file_dir'
root = element_tree.parse(file_path).getroot()

#namespace directory iterator
name_space = {node[0]: node[1] for _, node in element_tree.iterparse(file_path, events=['start-ns'])}
for key, value in name_space.items():   
    element_tree.register_namespace(key, value)

#xml parse, need to iterate through all owners
data_frame = pandas.DataFrame([{**{f"{d.tag.split('}')[1]}_{k}":v for k,v in d.items()}, 
  **{f"{co.tag.split('}')[1]}_{k}":v for co in d.findall("owners/currentowner", name_space) for k,v in co.items()},
  **{f"{addr.tag.split('}')[1]}_{k}":v for addr in d.findall("owners/currentowner/addr", name_space)
     for k,v in addr.items()} 
              }
 for d in root.findall("dept", name_space)
])

print(data_frame)

当前结果:

  dept_dept_id dept_col_two dept_col_three  dept_name currentowner_col_four  ... currentowner_col_six currentowner_name addr_col_seven addr_col_eight addr_col_nine
0        00001   00001value     00001false  some_name           00001bvalue  ...          00001bfalse         some_name    00001bvalue    00001bvalue   00001bfalse
1        00002   00002value     00002value  some_name            00002value  ...           00002false         some_name     00002value     00002value    00002false

预期结果:

  dept_dept_id dept_col_two dept_col_three  dept_name currentowner_col_four  ... currentowner_col_six currentowner_name addr_col_seven addr_col_eight addr_col_nine
0        00001   00001value     00001false  some_name           00001value  ...          00001false        some_name    00001value    00001value  00001false
2        00001   00001value     00001false  some_name           00001bvalue  ...         00001bfalse       some_name    00001bvalue   00001bvalue 00001bfalse
3        00002   00002value     00002value  some_name           00002value  ...          00002false        some_name    00002value    00002value  00002false

1 个答案:

答案 0 :(得分:0)

问题已在上一主题 (Loop through XML in Python) 中得到解决

解决方案:

import xml.etree.ElementTree as ET
import pandas as pd

root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()}, 
  **{f"{co.tag.split('}')[1]}.{k}":v  for k,v in co.items()}, 
  **{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
 for d in root.findall("ns0:dept", ns)
 for co in d.findall("ns0:owners/ns0:currentowner", ns)
])