我有一个名为Books.xml的文件 Books.xml是巨大的2Gb ,结构与此类似
<Books>
<Book>
<Detail ID="67">
<BookName>Code Complete 2</BookName>
<Author>Steve McConnell</Author>
<Pages>960</Pages>
<ISBN>0735619670</ISBN>
<BookName>Application Architecture Guide 2</BookName>
<Author>Microsoft Team</Author>
<Pages>496</Pages>
<ISBN>073562710X</ISBN>
</Detail>
</Book>
<Book>
<Detail ID="87">
<BookName>Rocking Python</BookName>
<Author>Guido Rossum</Author>
<Pages>960</Pages>
<ISBN>0735619690</ISBN>
<BookName>Python Rocks</BookName>
<Author>Microsoft Team</Author>
<Pages>496</Pages>
<ISBN>073562710X</ISBN>
</Detail>
</Book>
</Books>
我试图像这样将
分割到Book标签上import xml.etree.cElementTree as etree
filename = r'D:\test\Books.xml'
context = iter(etree.iterparse(filename, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
if event == 'start' and elem.tag == 'Book':
print(etree.dump(elem))
root.clear()
我得到了这样的结果
<Book>
<Detail ID="67">
<BookName>Code Complete 2</BookName>
<Author>Steve McConnell</Author>
<Pages>960</Pages>
<ISBN>0735619670</ISBN>
<BookName>Application Architecture Guide 2</BookName>
<Author>Microsoft Team</Author>
<Pages>496</Pages>
<ISBN>073562710X</ISBN>
</Detail>
</Book>
None
<Book>
<Detail ID="87">
<BookName>Rocking Python</BookName>
<Author>Guido Rossum</Author>
<Pages>960</Pages>
<ISBN>0735619690</ISBN>
<BookName>Python Rocks</BookName>
<Author>Microsoft Team</Author>
<Pages>496</Pages>
<ISBN>073562710X</ISBN>
</Detail>
</Book>
None
答案 0 :(得分:0)
以下是如何使用celery进行进程间排队,lxml进行操作,序列化和漂亮打印给定的xml:
#tasks.py file
from lxml import etree
from celery import Celery
app = Celery('tasks', broker='amqp://guest@localhost//')
@app.task
def print_book(book_xml):
book = etree.fromstring(book_xml)
# do something interesting ...
print(etree.tostring(book, pretty_print=True))
#caller.py file
from tasks import print_book
from lxml import etree
for _, book in etree.iterparse('Books.xml', tag="Book"):
book_xml = etree.tostring(book)
print_book.delay(book_xml)