拆分然后漂亮打印xml以存储在List中

时间:2014-03-07 10:10:45

标签: python lxml

我有一个名为Books.xml的文件 Books.xml是巨大的2Gb ,结构与此类似

<Books>
    <Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>        
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
    <Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
</Books>

我试图像这样将

分割到Book标签上
import xml.etree.cElementTree as etree
filename = r'D:\test\Books.xml'
context = iter(etree.iterparse(filename, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
    if event == 'start' and elem.tag == 'Book':
        print(etree.dump(elem))
        root.clear()

我得到了这样的结果

<Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>

None
<Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
None
  1. 如何摆脱
  2. 我想将书中破碎的碎片存储成某种形式 队列,然后让另一个程序出列。

1 个答案:

答案 0 :(得分:0)

以下是如何使用celery进行进程间排队,lxml进行操作,序列化和漂亮打​​印给定的xml:

#tasks.py file
from lxml import etree
from celery import Celery

app = Celery('tasks', broker='amqp://guest@localhost//')

@app.task
def print_book(book_xml):
    book = etree.fromstring(book_xml)
    # do something interesting ...
    print(etree.tostring(book, pretty_print=True))

#caller.py file
from tasks import print_book
from lxml import etree

for _, book in etree.iterparse('Books.xml', tag="Book"):
    book_xml = etree.tostring(book)
    print_book.delay(book_xml)