我已经编写了一个小的python脚本来基于Python中的Liza Daly's blog解析XML数据。但是,我的代码无法解析所有节点。因此,例如,当一个人有多个地址时,它仅使用第一个可用地址。 XML树如下所示:
- lgs
- entities
- entity
- id
- name
- addressess
- address
- address1
- address
- address1
- entity
- id
(...)
这将是python脚本:
import os
import time
from datetime import datetime
import lxml.etree as ET
import pandas as pd
xml_file = '.\\FILE.XML'
file_name, file_extension = os.path.splitext(os.path.basename(xml_file))
def fast_iter(context, *args, **kwargs):
# iterate through the XML file and create a dictionary
xml_dict = { "Id":[]
, "name":[]
, "address":[]
, "type":[] }
def try_to_append(xml_column, node):
# find and append XML nodes to the empty dictionary
try:
xml_dict[xml_column].append(elem.find(node).text)
except:
xml_dict[xml_column].append('')
for event, elem in context:
try_to_append('Id', 'id')
try_to_append('name', 'name')
try_to_append('address', 'addresses/address/address1'
elem.clear()
for ancestor in elem.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
return xml_dict
context = ET.iterparse(xml_file, tag='entity')
xml_dict = fast_iter(context)
df = pd.DataFrame(xml_dict)
所以我得到的是:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
但是约翰·多伊(John Doe)居住的名字上有3个地址。所以我期望:
ID | name | address | Type
1 | John Doe | Pythonstreet 4 | A
1 | John Doe | SQL street 33 | A
1 | John Doe | C++ street 99 | A
如您所见,由于唯一的额外节点是地址,其余的节点不会改变,只有地址被额外增加。当您将OPENROWSET()与BULK一起使用时,SQL Server将自动执行此操作。但是,该文件大于3GB,因此无法正常工作。
<?xml version='1.0' encoding='UTF-8'?>
<ffl>
<version>12345</version>
<entities>
<entity id="1124353" version="12345">
<name>DAVID, Beckham</name>
<footId>1021</footId>
<footCode>FIF</footCode>
<createdDate>09/02/1991</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Leytonstone, London, United Kingdom</pob>
</pobs>
<dobs>
<dob>May 02, 1975</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address></address>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SE10 0JF</postalCode>
</address>
<address>
<address1>35-37 Parkgate Road</address1>
<city>London</city>
<country>UK</country>
<countryName>UNITED KINGDOM</countryName>
<postalCode>SW11 4NP</postalCode>
</address>
</addresses>
</entity>
<entity id="1184359" version="12345">
<name>CRISTIANO, Ronaldo</name>
<footId>1022</footId>
<footCode>FIF</footCode>
<createdDate>20/03/2003</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Funchal, Madeira, Portugal</pob>
</pobs>
<dobs>
<dob>February 05, 1985</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Avenida da Boavista 1837</address>
<city>Porto</city>
<country>PT</country>
<countryName>PORTUGAL</countryName>
<postalCode>4100-133</postalCode>
</address>
<address>
<address1>Extramuros 74</address1>
<city>Madrid</city>
<country>ES</country>
<countryName>Spain</countryName>
<postalCode>28400</postalCode>
</address>
<address>
<address1>Viale Certosa 29</address1>
<city>Turin</city>
<country>IT</country>
<countryName>Italy</countryName>
<postalCode>10093</postalCode>
</address>
</addresses>
</entity>
<entity id="1984359" version="12345">
<name>LIONEL, Messi</name>
<footId>1023</footId>
<footCode>FIF</footCode>
<createdDate>09/02/2008</createdDate>
<source>FIF</source>
<OriginalSource>FIFA</OriginalSource>
<pobs>
<pob>Rosario, Argentina</pob>
</pobs>
<dobs>
<dob>June 24, 1987</dob>
</dobs>
<titles>
<title>Football player</title>
</titles>
<addresses>
<address>
<address>Almeyra 2588</address>
<city>San Martin</city>
<state>Buenos Aires</state>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
<address>
<address1>Comandante Izarduy 67</address1>
<city>Barcelona</city>
<country>ES</country>
<countryName>SPAIN</countryName>
<postalCode>08940</postalCode>
</address>
<address>
<address1>Humahuaca 4425</address1>
<city>Buenos Aires</city>
<country>AR</country>
<countryName>ARGENTINA</countryName>
<postalCode>N/A</postalCode>
</address>
</addresses>
</entity>
</entities>
</ffl>
答案 0 :(得分:0)
评论:由于它现在仅输出结果
输出结果仅用于演示,跟踪和调试。
要将record
和addresses
写到SQL
数据库中(例如使用 sqlite3
),请执行以下操作:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
为了大熊猫而扁平化
循环之外的前提: df = pd.DataFrame()
from copy import copy
addresses = copy(record['addresses'])
del record['addresses']
df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)
df = df.append(df_records, ignore_index=True)
问题:使用
etree.iterparse
将XML文件中的所有节点包括在内
以下class Entity
做:
XML
解析lxml.etree.iterparse
文件。 <entity>...</entity>
元素树在处理后被删除。 <entity>...</entity>
树构建dict {tag, value, ...}
。 generator objects
至yield
到dict
。 <addresses>/<address>
是元组[(address, {tag, text})...
的列表。 待办事项:
- 要整理成许多记录,请循环
record['addresses']
- 要等于不同的标签名称:
address
和address1
- 要展平,请使用序列标签,例如
<titels>
,<probs>
和<dobs>
from lxml import etree
class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})
:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))
return _elements
def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}
for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text
yield entity
if __name__ == "__main__":
with open('.\\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))
for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))
输出:仅显示仅第一个“记录”和仅 4 个字段。
注意:存在一个具有唯一标记名称的陷阱:address
和address1
record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity) id:1124353 name:DAVID, Beckham titles:[('title', 'Football player')] addresses: address:{'city': 'London', 'address': None, 'post... (omitted for brevity) address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
使用Python测试:3.5-lxml.etree:3.7.1