如何将XML文件转换为pandas数据框?

时间:2019-08-08 12:27:10

标签: python xml dataframe beautifulsoup xmltodict

我无法将XML转换为python数据框

能否请您帮我将XML解析为python数据框? 我似乎无法正常工作 这是我到达的距离:

import xmltodict 
import pandas as pd
import requests
from bs4 import BeautifulSoup
 def get_xml():
    url="http://energywatch.natgrid.co.uk/EDP-PublicUI/PublicPI/InstantaneousFlowWebService.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}
    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
                <soap12:Body>
                <GetInstantaneousFlowData xmlns="http://www.NationalGrid.com/EDP/UI/" />
                </soap12:Body>
                </soap12:Envelope>"""

    response = requests.post(url,data=body,headers=headers)
    return response.content

response = get_xml()
soup = BeautifulSoup(response, 'lxml')
table_columns = []
for item in soup.find_all(['EDPObjectName'.lower()]):
    table_columns.append(item.text)
table_columns=pd.DataFrame(table_columns)
table_rows=[]
for item in soup.find_all(['applicableat']):
    table_rows.append(item.text) 
df1=pd.DataFrame(table_rows).drop_duplicates() 
#df1=pd.to_datetime(df1)
table=[]
for item in soup.find_all(['flowrate']):
    table.append(item.text) 
df=pd.DataFrame(table)
 df_final=pd.DataFrame(df, columns=table_columns, index=df1)

这是我要寻找的结果:

                    ALDBROUGH   AVONMOUTH   BACTON BBL  …
    2019-08-08T13:00:00 0       1.23    5.1         …
    2019-08-08T13:02:00 0       1.23    5.1         …
    2019-08-08T13:04:00 0       3.23    5.1         …
    2019-08-08T13:06:00 0       3.23    5.1         …
    2019-08-08T13:08:00 0       3.23    5.23            …
    2019-08-08T13:10:00 0       4.23    5.204           …

2 个答案:

答案 0 :(得分:0)

此问题与其他xml解析问题非常相似,因为您具有分层数据结构,并且需要对其进行展平。我提出的解决方案将时间戳记,位置和流量变成列,并使每个日志条目都变成一行。我还遵循了简约性原则,即我尝试以某种方式解析xml,以便使扁平化的数据已经具有最容易转换为数据帧的格式。变量“数据”具有字典结构,每一列都有一个键。字典中的值是数据列表,列表中的每个位置指示该条目所属的行:

import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_xml():
    url="http://energywatch.natgrid.co.uk/EDP-PublicUI/PublicPI/InstantaneousFlowWebService.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}
    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope"><soap12:Body><GetInstantaneousFlowData xmlns="http://www.NationalGrid.com/EDP/UI/" /></soap12:Body></soap12:Envelope>"""
    response = requests.post(url,data=body,headers=headers)
    return response.content

response = get_xml()
soup = BeautifulSoup(response, 'lxml')

data = {'timestamp':[], 'place':[], 'flowrate':[]}

for group in soup.find_all('edpobjectbe'):
    place = group.find('edpobjectname').text
    for xml in group.find_all('edpenergydatabe'):
        data['place'].append(place)
        data['timestamp'].append(xml.find('applicableat').text)
        data['flowrate'].append(xml.find('flowrate').text)

df = pd.DataFrame(data)
df

请注意,我正在对父元素“ edpenergydatabe”执行find_all(),因此可以确保时间戳适用于与其关联的所有条目。如果您希望行和列的排列方式不同,现在可以使用诸如transpose()之类的熊猫函数来做到这一点。我希望这可以帮助您走上正确的轨道!

答案 1 :(得分:0)

尝试使用:

G

检查它是否适合您!