Question

https://drive.google.com/open?id=1M66WaMkwfkDoFW41MSwuKG4ZHyuSxFrO 伙计们，这是第一次使用XML。我读了很多帖子，但仍然无法处理我的数据。在链接中有我的一部分数据（在文件中引入了3个实体（mensaje）。原始一个约为35.000个实体）。从这些数据中，我需要创建一个熊猫数据框。

dt的每一行都应引用一个<mensaje> 第一列必须为<numerosolicitud>********</numerosolicitud> 第二列<codigocliente>**********</codigocliente> 然后我需要为每个<cuestionario><pregunta cod=***分配一列。我认为98个“鳕鱼”在所有“ mensajes”上都一样。我需要那些“鳕鱼”作为标题，如果需要包含文本则作为文本。

我认为这是一项基本任务，但是在阅读了几天的教程和帖子后，我仍然需要帮助。任何建议都将受到高度赞赏。

Answer 1

我为类似的用例制作了一个包装。它也可以在这里工作。

pip install pandas_read_xml

您可以做类似的事情

import pandas_read_xml as pdx

df = pdx.read_xml('filename.xml', ['data', 'mansaje'])

要展平，可以

df = pdx.flatten(df)

或

df = pdx.fully_flatten(df)

Answer 2

我找到了解决问题的方法。也许有人可以使事情变得更有效，但是这段代码对我有用。

import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse(r"C:\path\of\your\file")
root = tree.getroot()
df = pd.Dataframe()
counter=1
for mensaje in root.iter('mensaje'):
    df.loc[counter, 'numerosolicitud'] =  mensaje.find(".//numerosolicitud").text if not None else None
    df.loc[counter, 'codigocliente'] =  mensaje.find(".//codigocliente").text if not None else None
    df.loc[counter, 'riesgocb'] =  mensaje.find(".//riesgocb").text    if not None else None
    nodes = mensaje.findall(".//pregunta")
    for child in nodes:
        df.loc[counter, str(child.attrib["cod"] )] =  str((child.text if not None else None))
    print(counter)
    counter+=1

df.to_excel("output.xlsx")

将XML数据转换为Pandas数据框

2 个答案: