在社区的帮助下,我获得了熊猫数据框的 XML 解析器。我注意到有一个问题需要解决。在下面的数据示例中,有一个场景,其中一个 dept
拥有 1 个以上的所有者。
当前循环提取最新的,我需要来自 owners
的每个节点
数据:
<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false"
name = "some_name">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false"
name = "some_name">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
<currentowner col_four="00001bvalue"
col_five="00001bvalue"
col_six="00001bfalse"
name = "some_name">
<addr col_seven="00001bvalue"
col_eight="00001bvalue"
col_nine="00001bfalse"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value"
name = "some_name">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false"
name = "some_name">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>
当前代码:
import xml.etree.ElementTree as element_tree
import pandas
import fnmatch
import os
file_path = 'file_dir'
root = element_tree.parse(file_path).getroot()
#namespace directory iterator
name_space = {node[0]: node[1] for _, node in element_tree.iterparse(file_path, events=['start-ns'])}
for key, value in name_space.items():
element_tree.register_namespace(key, value)
#xml parse, need to iterate through all owners
data_frame = pandas.DataFrame([{**{f"{d.tag.split('}')[1]}_{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}_{k}":v for co in d.findall("owners/currentowner", name_space) for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}_{k}":v for addr in d.findall("owners/currentowner/addr", name_space)
for k,v in addr.items()}
}
for d in root.findall("dept", name_space)
])
print(data_frame)
当前结果:
dept_dept_id dept_col_two dept_col_three dept_name currentowner_col_four ... currentowner_col_six currentowner_name addr_col_seven addr_col_eight addr_col_nine
0 00001 00001value 00001false some_name 00001bvalue ... 00001bfalse some_name 00001bvalue 00001bvalue 00001bfalse
1 00002 00002value 00002value some_name 00002value ... 00002false some_name 00002value 00002value 00002false
预期结果:
dept_dept_id dept_col_two dept_col_three dept_name currentowner_col_four ... currentowner_col_six currentowner_name addr_col_seven addr_col_eight addr_col_nine
0 00001 00001value 00001false some_name 00001value ... 00001false some_name 00001value 00001value 00001false
2 00001 00001value 00001false some_name 00001bvalue ... 00001bfalse some_name 00001bvalue 00001bvalue 00001bfalse
3 00002 00002value 00002value some_name 00002value ... 00002false some_name 00002value 00002value 00002false
答案 0 :(得分:0)
问题已在上一主题 (Loop through XML in Python) 中得到解决
解决方案:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}.{k}":v for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
for d in root.findall("ns0:dept", ns)
for co in d.findall("ns0:owners/ns0:currentowner", ns)
])