我有以下XML(它是一个示例):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<HDR_DONNEES xmlns="http://ERABLE_HDR.com/ns1">
<Dates>
<Date valeur="14032019">
<Depart ACR_DepartHTA="BDX" ACR_PosteSource="BDX" GdoDepart="V.LOTC0018" Nom_DepartHTA="BOURLANG" PS_DepartHTA="V.LOT" NomPosteSource="VILLELOT">
<M H="1150" UTM="20850" ITM="94" IFg="0" UNB="1" INB="1"/>
</Depart>
<Depart ACR_DepartHTA="BDX" ACR_PosteSource="BDX" GdoDepart="V.LOTC0005" Nom_DepartHTA="MARCHE G" PS_DepartHTA="V.LOT" NomPosteSource="VILLELOT">
<M H="1150" UTM="20850" ITM="41" IFg="0" UNB="1" INB="1"/>
</Depart>
<Depart ACR_DepartHTA="NTS" ACR_PosteSource="NTS" GdoDepart="PALLUC2703" Nom_DepartHTA="FROIDFON" PS_DepartHTA="PALLU" NomPosteSource="PALLUAU">
<M H="1140" UTM="0" ITM="0" IFg="100" UNB="0" INB="1"/>
</Depart>
</Date>
</Dates>
</HDR_DONNEES>
我怎样才能将此XML解析为一个数据帧以具有这种结构?
|-acrDeparthta:字符串(nullable = true)
|-acrPostesource:字符串(nullable = true)
|-gdodepart:字符串(nullable = true)
|-nomDeparthta:字符串(nullable = true)
|-psDeparthta:字符串(nullable = true)
|-nompostesource:字符串(nullable = true)
|-creationDate:字符串(nullable = true)
|-m:数组(nullable = true)
| |-元素:struct(containsNull = true)
| | |-h:字符串(nullable = true)
| | |-utm:字符串(nullable = true)
| | |-ufg:字符串(nullable = true)
| | |-itm:字符串(nullable = true)
| | |-ifg:字符串(nullable = true)
| | |-unb:字符串(nullable = true)
| | |-inb:字符串(nullable = true)
“ M”下面的任何属性都是“ M”数组的一部分。
任何帮助将不胜感激,谢谢!
编辑:
我尝试过:
import xml.etree.ElementTree as ET
tree = ET.parse('testtest.xml')
root = tree.getroot()
for child in root:
print child.tag, child.attrib
但我得到的是:{http://ERABLE_HDR.com/ns1}日期{}
如果我在同一个循环中更深入地重复使用它
for child in child:
print child.tag, child.attrib
我得到这个:{http://ERABLE_HDR.com/ns1}日期{'valeur':'14032019'}
它不断地..
答案 0 :(得分:1)
我建议BeautifulSoup
阅读器使用lxml
(如果我正确理解了您的请求):
from bs4 import BeautifulSoup
import pandas as pd
xml=b"""\
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<HDR_DONNEES xmlns="http://ERABLE_HDR.com/ns1">
<Dates>
<Date valeur="14032019">
<Depart ACR_DepartHTA="BDX" ACR_PosteSource="BDX" GdoDepart="V.LOTC0018" Nom_DepartHTA="BOURLANG" PS_DepartHTA="V.LOT" NomPosteSource="VILLELOT">
<M H="1150" UTM="20850" ITM="94" IFg="0" UNB="1" INB="1"/>
</Depart>
<Depart ACR_DepartHTA="BDX" ACR_PosteSource="BDX" GdoDepart="V.LOTC0005" Nom_DepartHTA="MARCHE G" PS_DepartHTA="V.LOT" NomPosteSource="VILLELOT">
<M H="1150" UTM="20850" ITM="41" IFg="0" UNB="1" INB="1"/>
</Depart>
<Depart ACR_DepartHTA="NTS" ACR_PosteSource="NTS" GdoDepart="PALLUC2703" Nom_DepartHTA="FROIDFON" PS_DepartHTA="PALLU" NomPosteSource="PALLUAU">
<M H="1140" UTM="0" ITM="0" IFg="100" UNB="0" INB="1"/>
</Depart>
</Date>
</Dates>
</HDR_DONNEES>"""
soup = BeautifulSoup(xml,features="lxml")
data={}
for i,depart in enumerate(soup.find_all('depart')):
data[i]=depart.attrs
for m in depart.findChildren():
data[i]['m']=list(m.attrs.values())
df=pd.DataFrame.from_dict(data, orient='index')
print(df)
返回:
acr_departhta acr_postesource gdodepart nom_departhta ps_departhta nompostesource m
0 BDX BDX V.LOTC0018 BOURLANG V.LOT VILLELOT [1150, 20850, 94, 0, 1, 1]
1 BDX BDX V.LOTC0005 MARCHE G V.LOT VILLELOT [1150, 20850, 41, 0, 1, 1]
2 NTS NTS PALLUC2703 FROIDFON PALLU PALLUAU [1140, 0, 0, 100, 0, 1]