Python - 将非常大(6.4GB)的XML文件转换为JSON

时间:2013-10-10 02:37:01

标签: python xml json

基本上,我有一个6.4GB的XML文件,我想将其转换为JSON,然后将其保存到磁盘。我目前正在使用i7 2700k和16GB的ram运行OSX 10.8.4,并运行Python 64bit(双重检查)。我收到一个错误,我没有足够的内存来分配。我该如何解决这个问题?

print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()

print 'Converting'
newJSON = xmltodict.parse(data)

print 'Json Dumping'
newJSON = json.dumps(newJSON)

print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()

错误:

Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>
    data = f.read()
MemoryError

3 个答案:

答案 0 :(得分:8)

许多Python XML库支持逐步解析XML子元素,例如标准库中的xml.etree.ElementTree.iterparsexml.sax.parse。这些函数通常称为“XML Stream Parser”。

您使用的xmltodict库也具有流式传输模式。我认为它可以解决你的问题

https://github.com/martinblech/xmltodict#streaming-mode

答案 1 :(得分:2)

不是试图一次性读取文件然后处理它,而是希望以块的形式读取它并处理每个块的加载。这是处理大型XML文件时相当常见的情况,并且由 Simple API for XML(SAX)标准涵盖,该标准指定了用于解析XML流的回调API - 它是Python标准库的一部分。如上所述xml.sax.parsexml.etree.ETree

这是一个快速的XML to JSON转换器:

from collections import defaultdict
import json
import sys
import xml.etree.ElementTree as ET

def parse_xml(file_name):
    events = ("start", "end")
    context = ET.iterparse(file_name, events=events)

    return pt(context)

def pt(context, cur_elem=None):
    items = defaultdict(list)

    if cur_elem:
        items.update(cur_elem.attrib)

    text = ""

    for action, elem in context:
        # print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))

        if action == "start":
            items[elem.tag].append(pt(context, elem))
        elif action == "end":
            text = elem.text.strip() if elem.text else ""
            break

    if len(items) == 0:
        return text

    return { k: v[0] if len(v) == 1 else v for k, v in items.items() }

if __name__ == "__main__":
    json_data = parse_xml("large.xml")
    print(json.dumps(json_data, indent=2))

如果您正在查看大量的XML处理,请查看lxml库,除了标准模块之外,它还有很多有用的东西,同时也更容易使用。

http://lxml.de/tutorial.html

答案 2 :(得分:0)

这是一个Python3脚本,用于使用xmltodict的流功能将某种结构的XML转换为JSON。该脚本在内存中的存储量很小,因此对输入的大小没有限制。这有很多假设,但可能会让您起步,里程会有所不同,希望对您有所帮助。

#!/usr/bin/env python3
"""
Converts an XML file with a single outer list element
and a repeated list member element to JSON on stdout.
Processes large XML files with minimal memory using the
streaming feature of https://github.com/martinblech/xmltodict
which is required ("pip install xmltodict").

Expected input structure (element names are just examples):
  <mylist attr="a">
    <myitem name="foo"></myitem>
    <myitem name="bar"></myitem>
    <myitem name="baz"></myitem>
  </mylist>

Output:
  {
    "mylist": {
      "attr": "a",
      "myitem": [
        {
          "name": "foo"
        },
        {
          "name": "bar"
        },
        {
          "name": "baz"
        }
      ]
    }
  }
"""
import json
import os
import sys
import xmltodict


ROOT_SEEN = False


def handle_item(path, element):
    """
    Called by xmltodict on every item found at the specified depth.
    This requires a depth >= 2.
    """
    # print("path {} -> element: {}".format(path, element))
    global ROOT_SEEN
    if path is None and element is None:
        # after element n
        print(']')  # list of items
        print('}')  # outer list
        print('}')  # root
        return False
    elif ROOT_SEEN:
        # element 2..n
        print(",")
    else:
        # element 1
        ROOT_SEEN = True
        print('{')  # root
        # each path item is a tuple (name, OrderedDict)
        print('"{}"'.format(path[0][0]) + ': {')  # outer list
        # emit any root element attributes
        if path[0][1] is not None and len(path[0][1]) > 0:
            for key, value in path[0][1].items():
                print('"{}":"{}",'.format(key, value))
        # use the repeated element name for the JSON list
        print('"{}": ['.format(path[1][0]))  # list of items

    # Emit attributes and contents by merging the contents into
    # the ordered dict of attributes so the attr appear first.
    if path[1][1] is not None and len(path[1][1]) > 0:
        ordict = path[1][1]
        ordict.update(element)
    else:
        ordict = element
    print(json.dumps(ordict, indent=2))
    return True


def usage(args, err=None):
    """
    Emits a message and exits.
    """
    if err:
        print("{}: {}".format(args[0], err), file=sys.stderr)
    print("Usage: {} <xml-file-name>".format(args[0]), file=sys.stderr)
    sys.exit()


if __name__ == '__main__':
    if len(sys.argv) != 2:
        usage(sys.argv)
    xmlfile = sys.argv[1]
    if not os.path.isfile(xmlfile):
        usage(sys.argv, 'Not found or not a file: {}'.format(xmlfile))
    with open(xmlfile, 'rb') as f:
        # Set item_depth to turn on the streaming feature
        # Do not prefix attribute keys with @
        xmltodict.parse(f, item_depth=2, attr_prefix='', item_callback=handle_item)
    handle_item(None, None)