从嵌套的xml文件创建pandas数据帧

时间:2017-12-18 18:15:25

标签: python xml pandas lxml

这是xml文件的小部分。我想从这里创建一个数据库,每个标签都有唯一的列名和非重复数据。

尝试使用lxml并且我迄今为止所做的最好的事情是创建一个导致如下结果的数据框:

"    
SRCSGT
DATE    11112017
AGENCY  Department of Veterans Affairs
OFFICE  Canandaigua VAMC   
LOCATION    Department of Veterans Affairs Medical Center
ZIP 14424
etc, etc, "

xml

<?xml version="1.0" encoding="UTF-8"?>
<NOTICES>
  <SRCSGT>
    <DATE>11112017</DATE>
    <AGENCY><![CDATA[Department of Veterans Affairs]]></AGENCY>
    <OFFICE><![CDATA[Canandaigua VAMC]]></OFFICE>
    <LOCATION><![CDATA[Department of Veterans Affairs Medical Center]]></LOCATION>
    <ZIP>14424</ZIP>
    <CLASSCOD>H</CLASSCOD>
    <NAICS>238210</NAICS>
    <OFFADD><![CDATA[Department of Veterans Affairs;400 Fort Hill Ave.;Canandaigua NY  14424]]></OFFADD>
    <SUBJECT><![CDATA[H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQUIPMENT AT THE SYRACUSE VA MEDICAL CENTER AND THE ROME COMMUNITY BASED OUTPATIENT CLINIC  ]]></SUBJECT>
    <SOLNBR><![CDATA[9069]]></SOLNBR>
    <RESPDATE>11172017</RESPDATE>
    <ARCHDATE>12172017</ARCHDATE>
    <CONTACT><![CDATA[COiyiyS, JUhhiuN<a href="mailto:Juggyui@va.gov">CONTRACT SPECIALIST</a>]]></CONTACT>
    <DESC><![CDATA[This is a Sources Sought Notice. (a) The Government does not intend to award a contract on the basis of this Sources Sought or to otherwise pay for the information solicited.(b) Although "proposal," "offeror," contractor, and "offeror" may be used in this sources sought notice, any response will be treated as information only. It shall not be used as a proposal.Attachment(s) if applicable. ]]></DESC>
    <LINK><![CDATA[https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532/9069/listing.html]]></LINK>
    <EMAIL>
      <ADDRESS><![CDATA[Jigjhgjas@va.gov]]></ADDRESS>
      <DESC><![CDATA[CONTRACT SPECIALIST]]></DESC>
    </EMAIL>
    <SETASIDE>N/A</SETASIDE>
    <RECOVERY_ACT>N</RECOVERY_ACT>
    <DOCUMENT_PACKAGES>
      <PACKAGE><![CDATA[Attachment]]></PACKAGE>
    </DOCUMENT_PACKAGES>
  </SRCSGT>
</NOTICES>
我写的

代码

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml') #get xml file
root = trees.getroot() #get to root of file

tags = [] #list for holding all tags
datas = [] #list for holding all data in tags


for child in root: #root is a list of all elements in the xml file
    #print(child.tag)
    tt = child.tag #reads xml tag
    tags.append(tt)
    datas.append(child.text) #read xml tag data
    for c in child.findall('./'): # ./ finds children
        tt1 = c.tag
        tags.append(str(tt1))
        datas.append(c.text)
        for i in c.findall('./'): #each child node loads a new list of elements
            tt2 = i.tag
            tags.append(str(tt1)+ '_' + str(tt2))
            datas.append(i.text)
            for j in i.findall('./'):
                tt3 = j.tag
                tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3))
                datas.append(j.text)
                for k in j.findall('./'):
                    tt4 = k.tag
                    tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3) + '_' + str(tt4))
                    datas.append(k.text)

df = pd.DataFrame({"tags": tags,"values": datas})

理想的解决方案是这样的

 date agency office
1/1/10  A1    O1
1/1/10  A2    O2
1/1/10  A3    O3

所以基本上标签应该变成列标题,必须填充。不应重复列名,因此我可以创建标准数据库表。

1 个答案:

答案 0 :(得分:1)

考虑嵌套的xpath循环,首先循环遍历每个<SCRSGT>节点,然后使用迭代附加到DataFrame调用列表的内部字典提取所有SCRSGT的子节点:

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml')

d = []
for srcsgt in trees.xpath('//SRCSGT'):     # ITERATE THROUGH ROOT'S CHILDREN
    inner = {}
    for elem in srcsgt.xpath('//*'):       # ITERATE THROUGH ROOT'S DESCENDANTS PER CHILD
        if len(elem.text.strip()) > 0:     # KEEP ONLY NODES WITH NON-ZERO LENGTH TEXT
            inner[elem.tag] = elem.text

    d.append(inner)

df = pd.DataFrame(d)

<强>输出

print(df)

#             ADDRESS                          AGENCY  ARCHDATE CLASSCOD  \
# 0  Jigjhgjas@va.gov  Department of Veterans Affairs  12172017        H   

#                                              CONTACT      DATE  \
# 0  COiyiyS, JUhhiuN<a href="mailto:Juggyui@va.gov...  11112017   

#                   DESC                                               LINK  \
# 0  CONTRACT SPECIALIST  https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532...   

#                                         LOCATION   NAICS  \
# 0  Department of Veterans Affairs Medical Center  238210   

#                                               OFFADD            OFFICE  \
# 0  Department of Veterans Affairs;400 Fort Hill A...  Canandaigua VAMC   

#       PACKAGE RECOVERY_ACT  RESPDATE SETASIDE SOLNBR  \
# 0  Attachment            N  11172017      N/A   9069   

#                                              SUBJECT    ZIP  
# 0  H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQ...  14424