基本上,我有一个6.4GB的XML文件,我想将其转换为JSON,然后将其保存到磁盘。我目前正在使用i7 2700k和16GB的ram运行OSX 10.8.4,并运行Python 64bit(双重检查)。我收到一个错误,我没有足够的内存来分配。我该如何解决这个问题?
print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()
print 'Converting'
newJSON = xmltodict.parse(data)
print 'Json Dumping'
newJSON = json.dumps(newJSON)
print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()
错误:
Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>
data = f.read()
MemoryError
答案 0 :(得分:8)
许多Python XML库支持逐步解析XML子元素,例如标准库中的xml.etree.ElementTree.iterparse
和xml.sax.parse
。这些函数通常称为“XML Stream Parser”。
您使用的xmltodict库也具有流式传输模式。我认为它可以解决你的问题
答案 1 :(得分:2)
不是试图一次性读取文件然后处理它,而是希望以块的形式读取它并处理每个块的加载。这是处理大型XML文件时相当常见的情况,并且由 Simple API for XML(SAX)标准涵盖,该标准指定了用于解析XML流的回调API - 它是Python标准库的一部分。如上所述xml.sax.parse
和xml.etree.ETree
。
这是一个快速的XML to JSON转换器:
from collections import defaultdict
import json
import sys
import xml.etree.ElementTree as ET
def parse_xml(file_name):
events = ("start", "end")
context = ET.iterparse(file_name, events=events)
return pt(context)
def pt(context, cur_elem=None):
items = defaultdict(list)
if cur_elem:
items.update(cur_elem.attrib)
text = ""
for action, elem in context:
# print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))
if action == "start":
items[elem.tag].append(pt(context, elem))
elif action == "end":
text = elem.text.strip() if elem.text else ""
break
if len(items) == 0:
return text
return { k: v[0] if len(v) == 1 else v for k, v in items.items() }
if __name__ == "__main__":
json_data = parse_xml("large.xml")
print(json.dumps(json_data, indent=2))
如果您正在查看大量的XML处理,请查看lxml
库,除了标准模块之外,它还有很多有用的东西,同时也更容易使用。
答案 2 :(得分:0)
这是一个Python3脚本,用于使用xmltodict的流功能将某种结构的XML转换为JSON。该脚本在内存中的存储量很小,因此对输入的大小没有限制。这有很多假设,但可能会让您起步,里程会有所不同,希望对您有所帮助。
#!/usr/bin/env python3
"""
Converts an XML file with a single outer list element
and a repeated list member element to JSON on stdout.
Processes large XML files with minimal memory using the
streaming feature of https://github.com/martinblech/xmltodict
which is required ("pip install xmltodict").
Expected input structure (element names are just examples):
<mylist attr="a">
<myitem name="foo"></myitem>
<myitem name="bar"></myitem>
<myitem name="baz"></myitem>
</mylist>
Output:
{
"mylist": {
"attr": "a",
"myitem": [
{
"name": "foo"
},
{
"name": "bar"
},
{
"name": "baz"
}
]
}
}
"""
import json
import os
import sys
import xmltodict
ROOT_SEEN = False
def handle_item(path, element):
"""
Called by xmltodict on every item found at the specified depth.
This requires a depth >= 2.
"""
# print("path {} -> element: {}".format(path, element))
global ROOT_SEEN
if path is None and element is None:
# after element n
print(']') # list of items
print('}') # outer list
print('}') # root
return False
elif ROOT_SEEN:
# element 2..n
print(",")
else:
# element 1
ROOT_SEEN = True
print('{') # root
# each path item is a tuple (name, OrderedDict)
print('"{}"'.format(path[0][0]) + ': {') # outer list
# emit any root element attributes
if path[0][1] is not None and len(path[0][1]) > 0:
for key, value in path[0][1].items():
print('"{}":"{}",'.format(key, value))
# use the repeated element name for the JSON list
print('"{}": ['.format(path[1][0])) # list of items
# Emit attributes and contents by merging the contents into
# the ordered dict of attributes so the attr appear first.
if path[1][1] is not None and len(path[1][1]) > 0:
ordict = path[1][1]
ordict.update(element)
else:
ordict = element
print(json.dumps(ordict, indent=2))
return True
def usage(args, err=None):
"""
Emits a message and exits.
"""
if err:
print("{}: {}".format(args[0], err), file=sys.stderr)
print("Usage: {} <xml-file-name>".format(args[0]), file=sys.stderr)
sys.exit()
if __name__ == '__main__':
if len(sys.argv) != 2:
usage(sys.argv)
xmlfile = sys.argv[1]
if not os.path.isfile(xmlfile):
usage(sys.argv, 'Not found or not a file: {}'.format(xmlfile))
with open(xmlfile, 'rb') as f:
# Set item_depth to turn on the streaming feature
# Do not prefix attribute keys with @
xmltodict.parse(f, item_depth=2, attr_prefix='', item_callback=handle_item)
handle_item(None, None)