迭代大XML,占用内存少,并获得所有甚至嵌套的序列元素

时间:2018-12-14 07:52:16

标签: python xml lxml sequence iterparse

我已经编写了一个小的python脚本来基于Python中的Liza Daly's blog解析XML数据。但是,我的代码无法解析所有节点。因此,例如,当一个人有多个地址时,它仅使用第一个可用地址。 XML树如下所示:

- lgs
    - entities
        - entity
            - id
            - name
            - addressess
                - address
                    - address1
                - address
                    - address1
        - entity
            - id
              (...)

这将是python脚本:

import os
import time
from datetime import datetime
import lxml.etree as ET
import pandas as pd

xml_file = '.\\FILE.XML'

file_name, file_extension = os.path.splitext(os.path.basename(xml_file))

def fast_iter(context, *args, **kwargs):
# iterate through the XML file and create a dictionary 
    xml_dict = {  "Id":[]
    , "name":[]
    , "address":[]
    , "type":[] }

    def try_to_append(xml_column, node):
        # find and append XML nodes to the empty dictionary
        try:
            xml_dict[xml_column].append(elem.find(node).text)
        except:
            xml_dict[xml_column].append('')

    for event, elem in context:
        try_to_append('Id', 'id')
        try_to_append('name', 'name')
        try_to_append('address', 'addresses/address/address1'
        elem.clear()

        for ancestor in elem.xpath("ancestor-or-self::*"):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
        del context
        return xml_dict

context = ET.iterparse(xml_file, tag='entity')
xml_dict = fast_iter(context)
df = pd.DataFrame(xml_dict)

所以我得到的是:

ID | name     | address        | Type
1  | John Doe | Pythonstreet 4 | A

但是约翰·多伊(John Doe)居住的名字上有3个地址。所以我期望:

ID | name     | address        | Type
1  | John Doe | Pythonstreet 4 | A
1  | John Doe | SQL street 33  | A
1  | John Doe | C++ street 99  | A

如您所见,由于唯一的额外节点是地址,其余的节点不会改变,只有地址被额外增加。当您将OPENROWSET()与BULK一起使用时,SQL Server将自动执行此操作。但是,该文件大于3GB,因此无法正常工作。

更新:XML示例(假地址)

<?xml version='1.0' encoding='UTF-8'?>
<ffl>
<version>12345</version>
<entities>

    <entity id="1124353" version="12345">
        <name>DAVID, Beckham</name>
        <footId>1021</footId>
        <footCode>FIF</footCode>
        <createdDate>09/02/1991</createdDate>

        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Leytonstone, London, United Kingdom</pob>
        </pobs>
        <dobs>
            <dob>May 02, 1975</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address></address>
                <city>London</city>
                <country>UK</country>
                <countryName>UNITED KINGDOM</countryName>
                <postalCode>SE10 0JF</postalCode>
            </address>
            <address>
                <address1>35-37 Parkgate Road</address1>
                <city>London</city>
                <country>UK</country>
                <countryName>UNITED KINGDOM</countryName>
                <postalCode>SW11 4NP</postalCode>
            </address>
        </addresses>
    </entity>

    <entity id="1184359" version="12345">
        <name>CRISTIANO, Ronaldo</name>
        <footId>1022</footId>
        <footCode>FIF</footCode>
        <createdDate>20/03/2003</createdDate>
        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Funchal, Madeira, Portugal</pob>
        </pobs>
        <dobs>
            <dob>February 05, 1985</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address>Avenida da Boavista 1837</address>
                <city>Porto</city>
                <country>PT</country>
                <countryName>PORTUGAL</countryName>
                <postalCode>4100-133</postalCode>
            </address>
            <address>
                <address1>Extramuros 74</address1>
                <city>Madrid</city>
                <country>ES</country>
                <countryName>Spain</countryName>
                <postalCode>28400</postalCode>
            </address>
            <address>
                <address1>Viale Certosa 29</address1>
                <city>Turin</city>
                <country>IT</country>
                <countryName>Italy</countryName>
                <postalCode>10093</postalCode>
            </address>
        </addresses>
    </entity>

    <entity id="1984359" version="12345">
        <name>LIONEL, Messi</name>
        <footId>1023</footId>
        <footCode>FIF</footCode>
        <createdDate>09/02/2008</createdDate>
        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Rosario, Argentina</pob>
        </pobs>
        <dobs>
            <dob>June 24, 1987</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address>Almeyra 2588</address>
                <city>San Martin</city>
                <state>Buenos Aires</state>
                <country>AR</country>
                <countryName>ARGENTINA</countryName>
                <postalCode>N/A</postalCode>
            </address>
            <address>
                <address1>Comandante Izarduy 67</address1>
                <city>Barcelona</city>
                <country>ES</country>
                <countryName>SPAIN</countryName>
                <postalCode>08940</postalCode>
            </address>
            <address>
                <address1>Humahuaca 4425</address1>
                <city>Buenos Aires</city>
                <country>AR</country>
                <countryName>ARGENTINA</countryName>
                <postalCode>N/A</postalCode>
            </address>
        </addresses>
    </entity>

</entities>

</ffl>

1 个答案:

答案 0 :(得分:0)

  

评论:由于它现在仅输出结果

输出结果仅用于演示,跟踪和调试。
要将recordaddresses写到SQL数据库中(例如使用 sqlite3 ),请执行以下操作:

c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
    addr[1].update({'id': record['id']})
    addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)

为了大熊猫而扁平化
循环之外的前提: df = pd.DataFrame()

from copy import copy

addresses = copy(record['addresses'])
del record['addresses']

df_records = []
for addr in addresses:
    record.update(addr[1])
    df_records.append(record)

df = df.append(df_records, ignore_index=True)

  

问题:使用etree.iterparse将XML文件中的所有节点包括在内

以下class Entity做:

  • 使用XML解析lxml.etree.iterparse文件。
  • 没有文件大小限制,因为<entity>...</entity>元素树在处理后被删除
  • 从每棵<entity>...</entity>树构建dict {tag, value, ...}
  • 使用generator objectsyielddict
  • 序列元素,例如<addresses>/<address>是元组[(address, {tag, text})...的列表。
  

待办事项

     
      
  • 要整理成许多记录,请循环record['addresses']
  •   
  • 要等于不同的标签名称:addressaddress1
  •   
  • 要展平,请使用序列标签,例如<titels><probs><dobs>
  •   

from lxml import etree

class Entity:
    def __init__(self, fh):
        """
        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

        :param fh: File Handle from the XML File to parse
        """
        self.context = etree.iterparse(fh, events=("end",), tag=['entity'])

    def _parse(self):
        """
        Parse the XML File for all '<entity>...</entity>' Elements
        Clear/Delete the Element Tree after processing

        :return: Yield the current '<entity>...</entity>' Element Tree
        """
        for event, elem in self.context:
            yield elem

            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

    def sequence(self, elements):
        """
        Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
        If found a nested Sequence Element, e.g. <address>,
          to a Tuple ('address', {tag, text})

        :param elements: The Sequence Element
        :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
        """
        _elements = []
        for elem in elements:
            if len(elem):
                _elements.append((elem.tag, dict(self.sequence(elem))))
            else:
                _elements.append((elem.tag, elem.text))

        return _elements

    def __iter__(self):
        """
        Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()

        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
        """
        for xml_entity in self._parse():
            entity = {'id': xml_entity.attrib['id']}

            for elem in xml_entity:
                # if elem is Sequence
                if len(elem):
                    # Append tuple(tag, value)
                    entity[elem.tag] = self.sequence(elem)
                else:
                    entity[elem.tag] = elem.text

            yield entity

if __name__ == "__main__":
    with open('.\\FILE.XML', 'rb') as in_xml_
        for record in Entity(in_xml):
            print("record:{}".format(record))

            for key, value in record.items():
                if isinstance(value, (list)):
                    #print_list(key, value)
                    print("{}:{}".format(key, value))
                else:
                    print("{}:{}".format(key, value))
  

输出:仅显示第一个“记录”和仅 4 个字段。
  注意:存在一个具有唯一标记名称的陷阱addressaddress1

record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
    address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
    address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)

使用Python测试:3.5-lxml.etree:3.7.1