在MongoNYC 2013大会上,一位发言者提到他们使用了维基百科的副本来测试他们的全文搜索。我自己试图复制这个,但由于文件大小和格式的原因,我发现它非常重要。
这就是我正在做的事情:
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2
$ python
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('enwiki-latest-pages-articles.xml')
Killed
当我尝试使用标准XML解析器解析时,Python错误的大小与xml文件大小相同。有没有人有任何其他建议如何将9GB XML文件转换为JSON-y我可以加载到mongoDB?
更新1
按照下面Sean的建议,我也尝试了迭代元素树:
>>> import xml.etree.ElementTree as ET
>>> context = ET.iterparse('enwiki-latest-pages-articles.xml', events=("start", "end"))
>>> context = iter(context)
>>> event, root = context.next()
>>> for i in context[0:10]:
... print(i)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
>>> for event, elem in context[0:10]:
... if event == "end" and elem.tag == "record":
... print(elem)
... root.clear()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
同样,没有运气。
更新2
跟进Asya Kamsky的建议如下。
这是尝试使用xml2json
:
$ git clone https://github.com/hay/xml2json.git
$ ./xml2json/xml2json.py -t xml2json -o enwiki-latest-pages-articles.json enwiki-latest-pages-articles.xml
Traceback (most recent call last):
File "./xml2json/xml2json.py", line 199, in <module>
main()
File "./xml2json/xml2json.py", line 181, in main
input = open(arguments[0]).read()
MemoryError
这是xmlutils
:
$ pip install xmlutils
$ xml2json --input "enwiki-latest-pages-articles.xml" --output "enwiki-latest-pages-articles.json"
xml2sql by Kailash Nadh (http://nadh.in)
--help for help
Wrote to enwiki-latest-pages-articles.json
但内容只是一条记录。它没有迭代。
xmltodict
,看起来很有前途,因为它使用迭代Expat进行广告并且对维基百科很有用。但是在20分钟左右之后它也会耗尽内存:
>>> import xmltodict
>>> f = open('enwiki-latest-pages-articles.xml')
>>> doc = xmltodict.parse(f)
Killed
更新3
这是对Ross的答案的回应,将我的解析器建模在link he mentions:
之上from lxml import etree
file = 'enwiki-latest-pages-articles.xml'
def page_handler(page):
try:
print page.get('title','').encode('utf-8')
except:
print page
print "error"
class page_handler(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == 'title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode('utf-8'))
def close(self):
return self.text
def fast_iter(context, func):
for event, elem in context:
print(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
process_element = etree.XMLParser(target = page_handler())
context = etree.iterparse( file, tag='item' )
fast_iter(context,process_element)
错误是:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in fast_iter
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112653)
File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113223)
File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83186)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 22, column 1
答案 0 :(得分:1)
您需要使用iterparse
进行迭代,而不是将整个文件加载到内存中。至于如何转换为json甚至是python对象以存储在db中 - 请参阅:https://github.com/knadh/xmlutils.py/blob/master/xmlutils/xml2json.py
使用iterparse并保持较低内存占用的示例:
尝试Liza Daly's fast_iter的变体。在处理了一个元素elem
之后,它调用elem.clear()
来删除后代,并删除前面的兄弟。
from lxml import etree
def fast_iter(context, func):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
print(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)
Daly的文章非常精彩,尤其是在处理大型XML文件时。
答案 1 :(得分:1)